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Abstract 

We give sublinear-time approximation algorithms for some optimization problems arising 
in machine learning, such as training linear classifiers and finding minimum enclosing balls. 
O Our algorithms can be extended to some kernelized versions of these problems, such as SVDD, 

hard margin SVM, and L 2 -SVM, for which sublinear-time algorithms were not known before. 
These new algorithms use a combination of a novel sampling techniques and a new multiplicative 
update algorithm. We give lower bounds which show the running times of many of our algorithms 
to be nearly best possible in the unit-cost RAM model. We also give implementations of our 
algorithms in the semi-streaming setting, obtaining the first low pass polylogarithmic space and 

>«•✓ sublinear time algorithms achieving arbitrary approximation factor. 

h- ] 

O 1 Introduction 

i-H Linear classification is a fundamental problem of machine learning, in which positive and negative 

examples of a concept are represented in Euclidean space by their feature vectors, and we seek to 
find a hyperplane separating the two classes of vectors. 

The Perceptron Algorithm for linear classification is one of the oldest algorithms studied in 
machine learning [Nov62j [MP88] . It can be used to efficiently give a good approximate solution, 
if one exists, and has nice noise-stability properties which allow it to be used as a subroutine in 
many applications such as learning with noise |Byl94[ IBFKV98] , boosting |Ser 99] and more general 
optimization [DV04J. In addition, it is extremely simple to implement: the algorithm starts with 
an arbitrary hyperplane, and iteratively finds a vector on which it errs, and moves in the direction 
of this vector by adding a multiple of it to the normal vector to the current hyperplane. 

The standard implementation of the Perceptron Algorithm must iteratively find a "bad vector" 
which is classified incorrectly, that is, for which the inner product with the current normal vector has 
an incorrect sign. Our new algorithm is similar to the Perceptron Algorithm, in that it maintains a 
hyperplane and modifies it iteratively, according to the examples seen. However, instead of explicitly 
finding a bad vector, we run another dual learning algorithm to learn the "most adversarial" 
distribution over the vectors, and use that distribution to generate an "expected bad" vector. 
Moreover, we do not compute the inner products with the current normal vector exactly, but 
instead estimate them using a fast sampling-based scheme. 

Thus our update to the hyperplane uses a vector whose "badness" is determined quickly, but 
very crudely. We show that despite this, an approximate solution is still obtained in about the 
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Figure 1: Our results, except for semi-streaming and parallel 



same number of iterations as the standard perceptron. So our algorithm is faster; notably, it can 
be executed in time sublinear in the size of the input data, and still have good output, with high 
probability. (Here we must make some reasonable assumptions about the way in which the data is 
stored, as discussed below.) 

This technique applies more generally than to the perceptron: we also obtain sublinear time 
approximation algorithms for the related problems of finding an approximate Minimum Enclosing 
Ball (MEB) of a set of points, and training a Support Vector Machine (SVM), in the hard margin 
or L2-SVM formulations. 

We give lower bounds that imply that our algorithms for classification are best possible, up to 
polylogarithmic factors, in the unit-cost RAM model, while our bounds for MEB are best possible 
up to an 0(e _1 ) factor. For most of these bounds, we give a family of inputs such that a single 
coordinate, randomly "planted" over a large collection of input vector coordinates, determines the 
output to such a degree that all coordinates in the collection must be examined for even a 2/3 
probability of success. 

We show that our algorithms can be implemented in the parallel setting, and in the semi- 
streaming setting; for the latter, we need a careful analysis of arithmetic precision requirements 
and an implementation of our primal-dual algorithms using lazy updates, as well as some recent 
sampling technology [MW10J. 

Our approach can be extended to give algorithms for the kernelized versions of these problems, 
for some popular kernels including the Gaussian and polynomial, and also easily gives Las Vegas 
results, where the output guarantees always hold, and only the running time is probabilistic. Q Our 
approach also applies to the case of soft margin SVM (joint work in progress with Nati Srebro). 

Our main results, except for semi-streaming and parallel algorithms, are given in Figure [T] The 
notation is as follows. All the problems we consider have an n x d matrix A as input, with M 
nonzero entries, and with each row of A with Euclidean length no more than one. The parameter 
e > is the additive error; for MEB, this can be a relative error, after a simple O(M) preprocessing 
step. We use the asymptotic notation O(f) = 0(f ■ polylog^). The parameter a is the margin of 
the problem instance, explained below. The parameters s and q determine the standard deviation 
of a Gaussian kernel, and degree of a polynomial kernel, respectively. 

The time bounds given for our algorithms, except the Las Vegas ones, are under the assumption 
of constant error probability; for output guarantees that hold with probability 1 — 5, our bounds 
should be multiplied by login/ 5). 

The time bounds also require the assumption that the input data is stored in such a way that 
a given entry Aij can be recovered in constant time. This can be done by, for example, keeping 

1 For MEB and the kernelized versions, we assume that the Euclidean norms of the relevant input vectors are known. 
Even with the addition of this linear-time step, all our algorithms improve on prior bounds, with the exception of 
MEB when M = o(e _3//2 (n + d)). 
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each row Ai of A as a hash table. (Simply keeping the entries of the row in sorted order by column 
number is also sufficient, incurring an O(logd) overhead in running time for binary search.) 

By appropriately modifying our algorithms, we obtain algorithms with very low pass, space, and 
time complexity. Many problems cannot be well-approximated in one pass, so a model permitting a 
small number of passes over the data, called the semi-streaming model, has gained recent attention 
FKM+081 lMut05| . In this model the dat a is explicitly stored, and the few passes over it result in 
low I/O overhead. It is quite suitable for problems such as MEB, for which any algorithm using a 
single pass and sublinear (in n) space cannot approximate the optimum value to within better than 
a fixed constant [AS10J. Unlike traditional semi-streaming algorithms, we also want our algorithms 
to be sublinear time, so that in each pass only a small portion of the input is read. 

We assume we see the points (input rows) one at a time in an arbitrary order. The space 
is measured in bits. For MEB, we obtain an algorithm with 0(e~ 1 ) passes, 0(e~ 2 ) space, and 
0(e~ 3 (n + d)) total time. For linear classification, we obtain an algorithm with 0{e~ 2 ) passes, 
0(e~ 2 ) space, and 0(e _4 (n + d)) total time. For comparison, prior streaming algorithms for these 
problems [AS101 IZZC06] require a prohibitive U(d) space, and none achieved a sublinear o(nd) 
amount of time. Further, their guarantee is an approximation up to a fixed constant, rather than 
for a general e (though they can achieve a single pass). 

Formal Description: Classification In the linear classification problem, the learner is given 
a set of n labeled examples in the form of d-dimensional vectors, comprising the input matrix A. 
The labels comprise a vector y £ { + 1, — l} n . 

The goal is to find a separating hyperplane, that is, a normal vector x in the unit Euclidean ball 
IB such that for all i, y(i) ■ Aix > 0; here y(i) denotes the i'ih coordinate of y. As mentioned, we 
will assume throughout that Ai £ IB for all i £ [n], where generally [m] denotes the set of integers 
{1,2,. ..,m}. 

As is standard, we may assume that the labels y(i) are all 1, by taking Ai < Ai for any i with 

y(i) = —1. The approximation version of linear classification (which is necessary in case there is 
noise), is to find a vector x £ £ IB that is an e- approximate solution, that is, 

Vi' Ai<x e > maxminvljX — e. (1) 

The optimum for this formulation is obtained when ||x|| = 1, except when no separating hyperplane 
exists, and then the optimum x is the zero vector. 

Note that min^ AiX = minp ez \p T Ax, where A C M n is the unit simplex {p £ R n \ pi > 0, YLiPi = 
1}. Thus we can regard the optimum as the outcome of a game to determine p T Ax, between a 
minimizer choosing p £ A, and a maximizer choosing x £ B, yielding 

a = maxminp T Ac, 

xgb peA 

where this optimum a is called the margin. From standard duality results, a is also the optimum 
of the dual problem 

min maxp T Ax, 

peA xev> 

and the optimum vectors p* and x* are the same for both problems. 

The classical Perceptron Algorithm returns an e-approximate solution to this problem in 
iterations, and total time 0(e~ 2 M). 

For given 5 £ (0, 1), our new algorithm takes 0(e~ 2 (n + d)(logn) log(n/<5)) time to return an 
e-approximate solution with probability at least 1 — 5. Further, we show this is optimal in the 
unit-cost RAM model, up to poly-logarithmic factors. 
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Formal Description: Minimum Enclosing Ball (MEB) The MEB problem is to find the 
smallest Euclidean ball in IR d containing the rows of A. It is a special case of quadratic programming 
(QP) in the unit simplex, namely, to find mm pG/ \ p T b + p T AA T p, where b is an n- vector. This 
relationship, and the generalization of our MEB algorithm to QP in the simplex, is discussed in 
§3.3| for more general background on QP in the simplex, and related problems, see for example 

PUS]. 

1.1 Related work 

Perhaps the most closely related work is that of Grigoriadis and Khachiyan [G K95j . who showed 
how to approximately solve a zero-sum game up to additive precision e in time 0(e _2 (n + d)), 
where the game matrix is n x d. This problem is analogous to ours, and our algorithm is similar in 
structure to theirs, but where we minimize over p G A and maximize over x G B, their optimization 
has not only p but also i in a unit simplex. 

Their algorithm (and ours) relies on sampling based on x and p, to estimate inner products x T v 
or p T w for vectors v and w that are rows or columns of A. For a vector p G A, this estimation is 
easily done by returning Wi with probability pi. 

For vectors x G B, however, the natural estimation technique is to pick i with probability xf, 
and return fj/xj. The estimator from this £2 sample is less well-behaved, since it is unbounded, 
and can have a high variance. While £2 sampling has been used in streaming applications [MW10], 
it has not previously found applications in optimization due to this high variance problem. 

Indeed, it might seem surprising that sublinearity is at all possible, given that the correct 
classifier might be determined by very few examples, as shown in figure [2) It thus seems necessary 
to go over all examples at least once, instead of looking at noisy estimates based on sampling. 



Figure 2: The optimum x* is determined by the vectors near the horizontal axis. 

However, as we show, in our setting there is a version of the fundamental Multiplicative Weights 
(MW) technique that can cope with unbounded updates, and for which the variance of ^-sampling 
is manageable. In our version of MW, the multiplier associated with a value z is quadratic in z, in 
contrast to the more standard multiplier that is exponential in z; while the latter is a fundamental 
building block in approximate optimization algorithms, as discussed by Plotkin et al. [PST91J, in 
our setting such exponential updates can lead to a very expensive d^ 1 ^ iterations. 
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We analyze MW from the perspective of on-line optimization, and show that our version of MW 
has low expected expected regret given only that the random updates have the variance bounds 
provable for £2 sampling. We also use another technique from on-line optimization, a gradient 
descent variant which is better suited for the ball. 

For the special case of zero-sum games in which the entries are all non-negative (this is equivalent 
to packing and covering linear programs), Koufogiannakis and Young |KY07| give a sublinear- 
time algorithm which returns a relative approximation in time 0(e~ 2 (n + d)). Our lower bounds 
show that a similar relative approximation bound for sublinear algorithms is impossible for general 
classification, and hence general linear programming. 



2 Linear Classification and the Perceptron 

Before our algorithm, some reminders and further notation: A C R n is the unit simplex {p G M. n \ 
Pi > ®}^2iPi = 1}, IB C M d is the Euclidean unit ball, and the unsubscripted \\x\\ denotes the 
Euclidean norm ||x||2- The n- vector, all of whose entries are one, is denoted by l n . 

The z'th row of the input matrix A is denoted Ai, although a vector is a column vector unless 
otherwise indicated. The i'th coordinate of vector v is denoted v(i). For a vector v, we let v 2 
denote the vector whose coordinates have v 2 (i) = v{i) 2 for all i. 



2.1 The Sublinear Perceptron 

Our sublinear perceptron algorithm is given in Figure [TJ The algorithm maintains a vector wt G R n , 
with nonnegative coordinates, and also pt £ A, which is wt scaled to have unit l\ norm. A vector 
yt G M. d is maintained also, and xt which is yt scaled to have Euclidean norm no larger than one. 
These normalizations are done on line |U 

In lines [5] and [HJ the algorithm is updating yt by adding a row of A randomly chosen using pt. 
This is a randomized version of Online Gradient Descent (OGD); due to the random choice of it, 
Ai t is an unbiased estimator of pj A, which is the gradient of pj Ay with respect to y. 

In lines [7] through 12, the algorithm is updating wt using a column jt of A randomly chosen 



based on xt, and also using the value xt(jt). This is a version of the Multiplicative Weights (MW) 
technique for online optimization in the unit simplex, where vt is an unbiased estimator of Axt, the 
gradient of p T Axt with respect to p. 

Actually, vt is not unbiased, after the clip operation: for z, V G M, clip(z, V) = min{V, max{— V, z}}, 
and our analysis is helped by clipping the entries of vt\ we show that the resulting slight bias is not 
harmful. 



As discussed in {1.1, the sampling used to choose jt (and update pt) is i2-sampling, and that 
for it, -^-sampling. These techniques, which can be regarded as special cases of an £ p -sampling 
technique, for p G [l,oo), yield unbiased estimators of vector dot products. It is important for us 
also that ^-sampling has a variance bound here; in particular, for each relevant i and t, 

em*) 2 ] < pi|| 2 IN| 2 < 1. (2) 

First we note the running time. 

Theorem 2.1. The sublinear perceptron takes 0(e~ 2 logn) iterations, with a total running time of 
0(e~ 2 (n + d) log re). 

Proof. The algorithm iterates T = 0(-^§^) times. Each iteration requires: 

1. One £2 sample per iterate, which takes 0(d) time using known data structures. 
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Algorithm 1 Sublinear Perception 



1: Input: e > 0, A G R nxd with A4 G B for i G [n]. 

2: Let T «- 200 2 e" 2 logn, y 1 4- 0, <- l n , 

„ , 1 / l °s n 

'I ^~ 100 V T ' 

3: for t = 1 to T do 

ll^tlli max |i>ll?yi||| 
5: Choose it G [n] by it <— i with prob. Pt(i). 

6: y t+ i <-y t + ^/A t 

7: Choose jt G [d] by 

jt <— j with probability xt(j) 2 /||xt || 2 . 

8: for i G [n] do 

9: v t (i) i- A l (j t )\\x t \\ 2 /x t (jt) 
10: ut(t) <- c\ip(v t (i), I/77) 
11: io t +i(i) <r- w t (i)(l - Tjv t {i) + rfv^i) 2 ) 
12: end for 

13: end for 

14: return x = y, ^ t ccj 



2. Sampling ij G_r pt which takes O(n) time. 

3. The update of xt and pt, which takes 0(n + d) time. 

The total running time is 0(e~ 2 (n + d) logra). □ 

Next we analyze the output quality. The proof uses new tools from regret minimization and 
sampling that are the building blocks of most of our upper bound results. 
Let us first state the MW algorithm used in all our algorithms. 

Definition 2.2 (MW algorithm). Consider a sequence of vectors qx, . . . ,qx G W 1 . The Multiplica- 
tive Weights (MW) algorithm is as follows. Let w\ ^— l n , and for t > 1, 

Pt <— Wt/\\wt\\i, (3) 

and for 0<?)£R 

<- w t {i){i - mt{i) + v 2 Qt(i) 2 ), (4) 

The following is a key lemma, which proves a novel bound on the regret of the MW algorithm 
above, suitable for the case where the losses are random variables with bounded variance. This is 
proven below, after a concentration lemma, and the main theorem and its proof. 

Lemma 2.3 (Variance MW Lemma). The MW algorithm satisfies 

Ptlt < min ieN Etepl max Ut(i), 

te[T] 

The following three lemmas give concentration bounds on our random variables from their 
expectations. The first two are based on standard martingale analysis, and the last is a simple 
Markov application. The proofs are deferred to Appendix [B| 
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Lemma 2.4. For 



V < \/iot > w ^ probability at least 1 — 0(l/n) ; 



max V [«t(i) - AiXt] < 90r/r. 
i *■ — ' 

te[T] 



Lemma 2.5. Fori] < y -^gyr, with probability at least 1— 0(l/n), it holds that Ylte[T\ ^k x t ~ YstPt Vt 
WOrjT. 

Lemma 2.6. Wzf/i probability at least 1 — I, ?t ZioWs i/iai YltPt v t — 

Theorem 2.7 (Main Theorem). probability 1/2, i/ie sublinear perceptron returns a solution 

x that is an e- approximation. 

Proof. First we use the regret bounds for lazy gradient descent to lower bound X^tepr] A h x ti next 
we get an upper bound for that quantity using the Weak Regret lemma above, and then we combine 
the two. 



< 



By definition, AiX* > a for all i £ [n], and so, using the bound of Lemma A. 2 



To- < max V Aj.sc < V A it x t + 2v / 2T, 



(5) 



te[T] 



te[T] 



or rearranging, 



A k x t >Ta- 2V2T. 

*e[T] 



(6) 



Now we turn to the MW part of our algorithm. By the Weak Regret Lemma 2.3, and using the 
clipping of vt(i), 

V pjv t < min V v t {i) + (log n)/r] + V V] Pt v t- 
te[T] 1 'te[T] te[T] 



By Lemma 2.4 above, with high probability, for any % € [n], 

5^ A<s t > ^ wt(i) - 90 ?? T, 
te[T] te[T] 

so that with high probability 

^2 Pt Vt - mm fe[n] Ete[T] + ( lo § n )h 
te[T] 



+vJ2telT]PJ^t+^Tr ] . 



Combining Q and ^ we get 



min Aja;^ > — (log n)/ry — 7/ Pt v t ~ 90T?7 



ie In 



te[T] 

+ Tex - 2\/2T - I ^ 

te[T] 



X] Ai t Xt \ 
te[T] 



(7) 
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By Lemmas 2.5 2.6 we have w.p at least § - 0(£) > 



min V > -(logn)/?? - 8r/T - 90Trj + Ta - 2 V / 2T - lOOr/T 

o'£=IV>l ^- — 



te[T] 



„ log n m 
> Tcr 5 2007/T. 



Dividing through by T, and using our choice of rj, we have min^ AiX > a — e/2 w.p. at least 
least 1/2 as claimed. □ 



Proof of Lemma 2.3, Weak Regret. We first show an upper bound on log||uT+i ||i, then a lower 
bound, and then relate the two. 
From @ and ^ we have 

ie[rc] 
ie[n] 

= INIIlC 1 - ??P t T % + V 2 pjQt)- 
This implies by induction on i, and using 1 + z < exp(z) for that 

log||wr+i||i = logn + log(l - r)pjq t + rj 2 p] ' q 2 ) < logn - ^ r/p7% + V 2 PtQt- ( 8 ) 
tepr] te[T] 

Now for the lower bound. From Q we have by induction on t that 

w T+ i(i) = (1 - J7?t(i) + V 2 Qt(i) 2 ), 

tE[T] 

and so 



log||u)T+i||i = log 



> log 



J II^ 1- VQt(i) + V 2 Qt(i] 
ie[n] t£[T] 

max TT (1 - 7/?t(i) + r? 2 %(i 

iSfn 1 - LJ - 



te[T] 



max log(l - 77<&(i) + r] 2 q t (iY 



*e[T] 



> 



max V [min{-7/g t (i), 1}], 



where the last inequality uses the fact that 1 + z + z 2 > exp(min{2i, 1}) for all z € 
Putting this together with the upper bound (pi), we have 



max [mm{—r]q t (i) , 1}] < logn — VPt Qt + 



2 T 2 
V PtQt, 



te[T] 



te[T] 
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Changing sides 



^2 VP!® < - max ^2 W-toWj !}] + log n + i] 2 pjq 2 , 



te[T] 1 J te[T 



min [max{rygt(«), -1}] + logn + rj 2 pjq 2 , 



ien. 

te[T] 



and the lemma follows, dividing through by n. □ 

Corollary 2.8 (Dual solution). The vector p = ^2 t ei t /T is, with probability 1/2, an 0{e)- 
approximate dual solution. 



Proof. Observing in ^ that the middle expression max xe B X^tepr] 1S equal to Tmax xg Bj) T Ax, 

te[T] 



we have Tmax^ip 1 Ax < X^efTl -^t^* + 2v / 2T, or changing sides 



V Aj,x t > Tmaxp T ii - 2v / 2T 
te[T] 



Recall from ([7]) that with high probability, 



pju* < min AjX t + (logn)/?? + 77 p7^ 2 + 90Tr?. (9) 



te[T] 1 J te[T] te[T] 



Following the proof of the main Theorem, we combine both inequalities and use Lemmas 2.5|2.6 
such that with probability at least I: 

Tmax/ii < min ^ AiX t + (log n)/r/ + 77 ^ p^t + 90T7 7 + 2\/2T + | V] ~ ^t x *l 



< Ta + 0{y/T log 



n 



Dividing through by T we have with probability at least | that m.a,yi x ^,p^ Ax < a + O(e) for our 
choice of T and n. □ 

2.2 High Success Probability and Las Vegas 

Given two vectors u,v £M, we have seen that a single ^2-sample is an unbiased estimator of their 
inner product with variance at most one. Averaging such samples reduces the variance to e 2 , 
which reduces the standard deviation to e. Repeating O(logy) such estimates, and taking the 
median, gives an estimator denoted X e g, which satisfies, via a Chernoff bound: 

Pr[|X £i<5 -i; T 7i| >e}<5 

As an immediate corollary of this fact we obtain: 

Corollary 2.9. There exists a randomized algorithm that with probability 1 — 5, successfully de- 
termines whether a given hyperplane with normal vector x G B, together with an instance of linear 
classification and parameter a > 0, is an e-approximate solution. The algorithm runs in time 
0(d+^log^). 
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Proof. Let 8' = 5/n. Generate the random variable X E gi for each inner product pair (x,Ai), and 
return true if and only if X e ^> > a — e for each pair. By the observation above and taking union 
bound over all n inner products, with probability 1 — 8 the estimate X e §> was e-accurate for all 
inner-product pairs, and hence the algorithm returned a correct answer. 

The running time includes preprocessing of x in 0(d) time, and n inner-product estimates, for a 
total of 0(d + g log f). □ 

Hence, we can amplify the success probability of Algorithm [T] to 1 — 8 for any 8 > albeit 
incurring additional poly-log factors in running time: 

Corollary 2.10 (High probability). There exists a randomized algorithm that with probability 1 — 8 
returns an e- approximate solution to the linear classification problem, and runs in expected time 
O(^logf). 

Proof. Run Algorithm [l] for log 2 | times to generate that many candidate solutions. By Theorem 
|2.7[ at least one candidate solution is an e-approximate solution with probability at least 1 — 

r~ t °g2 1 = i _ s. 

For each candidate solution apply the verification procedure above with success probability 
1 — 8 2 > 1 — j^jp") arid all verifications will be correct again with probability at least 1 — 8. Hence, 

both events hold with probability at least 1 — 28. The result follows after adjusting constants. 

The worst-case running time comes to 0(p^ log j log |). However, we can generate the can- 
didate solutions and verify them one at a time, rather than all at once. The expected number of 
candidates we need to generate is constant. □ 

It is also possible to obtain an algorithm that never errs: 

Corollary 2.11 (Las Vegas Version) . After 0(e~ 2 logn) iterations, the sublinear perceptron returns 
a solution that with probability 1/2 can be verified in O(M) time to be e- approximate. Thus with 
expected 0(1) repetitions, and a total of expected 0(M + e~ 2 [n + d)logn) work, a verified e- 
approximate solution can be found. 

Proof. We have 

mlnAiX < a < \\p T A\\, 

i 

and so if 

rain AiX > \\p T A\\ - e, (10) 

i 

then x is an e-approximate solution, and x will pass this test if it and p are (e/2)-approximate 
solutions, and the same for p. 

Thus, running the algorithm for a constant factor more iterations, so that with probability 
1/2, x and p are both (e/2)-approximate solutions, it can be verified that both are e-approximate 
solutions. □ 

2.3 Further Optimizations 



The regret of OGD as given in Lemma A. 2 is smaller than the dual strategy of random MW. We can 
take advantage of this and improve the running time slightly, by replacing line [Hj of the sublinear 
algorithm with the line shown below. 

This has the effect of increasing the regret of the primal online algorithm by a log n factor, which 
does not hurt the number of iterations required to converge, since the overall regret is dominated 
by that of the MW algorithm. 
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With probability let y t+ \ <- y t + A it (else do nothing). 



Since the primal solution x% is not updated in every iteration, we improve the running time 
slightly to 

0(e~ 2 logn(n + d/(\og 1/e + log logn))). 
We use this technique to greater effect for the MEB problem, where it is discussed in more detail. 

2.4 Implications in the PAC model 

Consider the "separable" case of hyperplane learning, in which there exists a hyperplane classifying 
all data points correctly. It is well known that the concept class of hyperplanes in d dimensions 
with margin a has effective dimension at most min{cf, -\} + 1. Consider the case in which the 
margin is significant, i.e. \ < d. PAC learning theory implies that the number of examples needed 
to attain generalization error of 5 is O(-j-f). 

Using the method of online to batch conversion (see [CBCG04]), and applying the online gradient 
decent algorithm, it is possible to obtain 5 generalization error in time 0(-4?) time, by going over 
the data once and performing a gradient step on each example. 

Our algorithm improves upon this running time bound as follows: we use the sublinear per- 
ceptron to compute a cr/2-approximation to the best hyperplane over the test data, where the 
number of examples is taken to be n = O(-jg) (in order to obtain 5 generalization error). As 

shown previously, the total running time amounts to 0( ■) = 0(-h + -4r). 

This improves upon standard methods by a factor of 0(a 2 d), which is always an improvement 
by our initial assumption on a and d. 



3 Strongly convex problems: MEB and SVM 
3.1 Minimum Enclosing Ball 

In the Minimum Enclosing Ball problem the input consists of a matrix A 6 M. nxd . The rows are 
interpreted as vectors and the problem is to find a vector x £ ~R. d such that 

x* = argmin a . gK d max||x — Ai\\ 2 
ie[n] 

We further assume for this problem that all vectors Ai have Euclidean norm at most one. Denote 
by a = max ig j n ] \\x — Ai\\ 2 the radius of the optimal ball, and we say that a solution is e-approximate 
if the ball it generates has radius at most a + e. 

As in the case of linear classification, to obtain tight running time bounds we use a primal-dual 
approach; the algorithm is given below. 

(This is a "conceptual" version of the algorithm: in the analysis of the running time, we use 
the fact that we can batch together the updates for wt over the iterations for which xt does not 
change.) 

Theorem 3.1. Algorithmic runs in O(^p^) iterations, with a total expected running time of 



, n d 
0[^ + 



e 2 e 



and with probability 1/2, returns an e-approximate solution. 
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Algorithm 2 Sublinear Primal-Dual MEB 



Input: e > 0, A G R nxd with A; G B for i G [n] and ||Aj|| known. 
Let T i- 6(e~ 2 logn) , y x «- 0, w x «- 1, 77 «- (log n)/T, a ' 1 " l 7 
for t = 1 to T do 



log n ' 



ft ^ Ml. 

Choose it G [n] by it •<— i with probability pt(i)- 

With probability a, update yt+i ^ Ut + M t ; ( e l se do nothing) 

Choose jt G [d] by j t <— j with probability xt(j) 2 /\\xt || 2 . 
for i G [n] do 

5t(i) <- -2A(j t )||xt|| 2 /x t (jt) + ||A|| 2 + |M| 2 . 

v t (i) <- clip(v t (i), i). 

M>t+i(«) <~ + »7f * («) + i] 2 v t (i) 2 ). 

end for 
end for 

return x = A ^ t 



Proof. Except for the running time analysis, the proof of this theorem is very similar to that of 
Theorem |2.7[ where we take advantage of a tighter regret bound for strictly convex loss functions 
in the case of MEB, for which the OGD algorithm with a learning rate of + is known to obtain a 
tighter regret bound of O(logT) instead of 0(VT). For presentation, we use asymptotic notation 
rather than computing the exact constants (as done for the linear classification problem). 



Let ft(x) 
that f t (x) = 



A- II 2 



Notice that argmin xg B Yl T =i fr( x ) 



By Lemma 



A.5 



such 



\x ■ 



Ai t || 2 , with G < 2 and H = 2, and x* being the solution to the instance, we have 



E { Ct }ElK - A it \\ 2 } < E {ct} [J2\\x* - ^ || 2 ] + -logT < To- + -logT. 



(11) 



where a is the squared MEB radius. Here the expectation is taken only over the random coin tosses 
for updating xt, denoted q, and holds for any outcome of the indices it sampled from p t and the 
coordinates jt used for the £2 sampling. 



Now we turn to the MW part of our algorithm. By the Weak Regret Lemma 2.3, using the 
clipping of Vf(i), and reversing inequalities to account for the change of sign, we have 



V pjv t > max V v t (i) 
z — ' te n * — 4 

te[T] 1 1 te[T] 

Using Lemmas B.4|B.5 with high probability 



logn ^ T 2 > 

te[T] 



Vi G \n] 



Yl vt{i 

te[T] 



> £ 

te[T] 



x t 



0( V T), 



£ \\x t -A it \\ 2 -J2pJvt =0( V T). 
te[T] t 

Plugging these two facts in the previous inequality we have w.h.p 

^ll 2 -o(^ 



\\x t - A it \\ 2 > max ||A 
te[T] 1 1 te[T] 



+ V J^pJvt +Trj). 



te[T] 
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This holds w.h.p over the random choices of {it, jt}, and irrespective of the coin tosses {q}. Hence, 
we can take expectations w.r.t {ct}, and obtain 

E { Ci >E iK-^f]^ E {c*}[ m <«E ii^-^ii 2 ]-°(— +vJ2pt v t +Tr >)- ( 12 ) 

ie[T] 1 1 te[T] ' te[T] 



Combining with equation (11), we obtain that w.h.p. over the random variables {it,jt} 



To~ + — log T > E {ct} [max V \\x t - Af] - + V V p] ' vj + Tr?) 

1 J te[T] ' te[T] 



Rearranging and using Lemma B.8 we have w.p. at least \ 



E {cf} [max V \\ Xt - ^|| 2 ] < 0(Ta + ^ + ^ + Trj) 

Dividing through by T and applying Jensen's inequality, we have 

E[max||x - A-|| 2 1 < i Efmax V \\x t - Aif] < 0(a + + + n). 

te[T] 1 

Optimizing over the values of a, rj, and T, this implies that the expected error is 0(s), and so using 
Markov's inequality, x is a 0(e)-approximate solution with probability at least 1/2. 

Running time The algorithm above consists of T = O(-^r) iterations. Naively, this would 
result in the same running time as for linear classification. Yet notice that Xt changes only an 
expected aT times, and only then do we perform an 0(d) operation. The expected number of 
iterations in which xt changes is aT < 16e _1 logT, and so the running time is 

0{e-\\ogT) -d+ ^ -n)) = 0{e- 2 n + e _1 d). 

□ 

The following Corollary is a direct analogue of Corollary |2.8| 

Corollary 3.2 (Dual solution). The vector p = ^2 t ei t /T is, with probability 1/2, an 0{e)- 
approximate dual solution. 

3.2 High Success Probability and Las Vegas 

As for linear classification, we can amplify the success probability of Algorithm [2] to 1 — 5 for any 
5 > albeit incurring additional poly-log factors in running time. 

Corollary 3.3 (MEB high probability). There exists a randomized algorithm that with probability 
1 — 5 returns an e-approximate solution to the MEB problem, and runs in expected time 0(js log ^ + 
^log jr). There is also a randomized algorithm that returns an e-approximate solution in 0(M + 
g + £) time. 



12 



Proof. We can estimate the distance between two points in B in 0(e~ 2 log(l/<5)) time, with er- 
ror at most e and failure probability at most 5, using the dot product estimator described in 
§2.2| Therefore we can estimate the maximum distance of a given point to every input point in 
0{ne~ 2 log(n/<5)) time, with error at most e and failure probability at most 5. This distance is 
a — e, where a is the optimal radius attainable, w.p. 1 — 5. 

Because Algorithm [2] yields an e-dual solution with probability 1/2, we can use this solution to 
verify that the radius of any possible solution to the farthest point is at least a — e. 

So, to obtain a solution as described in the lemma statement, run Algorithm [2| and verify that 
it yields an e-approximation, using this approximate dual solution; with probability 1/2, this gives 
a verified e-approximation. Keep trying until this succeeds, in an expected 2 trials. 

For a Las Vegas algorithm, we simply apply the same scheme, but verify the distances exactly. 

□ 



3.3 Convex Quadratic Programming in the Simplex 

We can extend our approach to problems of the form 

mmp T b + p T AA T p, (13) 

where b £ W 1 , A G R nxd , and A is, as usual, the unit simplex in M. n . As is well known, and as 
we partially review below, this problem includes the MEB problem, margin estimation as for hard 
margin support vector machines, the L2-SVM variant of support vector machines, the problem of 
finding the shortest vector in a polytope, and others. 

Applying \\v — x\\ 2 = v T v + x T x — 2v T x > with v <— A T p, we have 

max2p T Ax - \\x\\ 2 = p T AA T p, (14) 



with equality at x = A p. Thus ( 13 ) can be written as 



minmaxp (b + 2Ax — lnll^ll )• (15) 



The Wolfe dual of this problem exchanges the max and min: 



maxminp (b + 2Ax — l n ||x|| 2 ). (16) 



Since 

minp T (6 + 2Ax - l n |M| 2 ) = min6(i) + 2A { x + ||x|| 2 , (17) 

pgA i 

with equality when p^ = if i is not a minimizer, the dual can also be expressed as 

maxmin6(i) + 2A{X — ||x|| 2 (18) 



By the two relations (14) and (17) used to derive the dual problem from the primal, we have 
immediately the weak duality condition that the objective function of the dual (18) is always no 
more than the objective function value of the primal (13). The strong duality condition, that the 
two problems take the same optimal value, also holds here; indeed, the optimum also solves (14), 
and the optimal p* also solves (17). 
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To generalize Algorithm [2J we make vt an unbiased estimator of b + 2Axt — l n \\xt\\ 2 , and set 
Xt+i to be the minimizer of 

b(i t ,) + 2A lt ,x t ' - \\x t >\\ , 

t'e[t] 

namely, as with MEB, y t +i <— J2t'e[t] -^v> an< ^ x t+i Vt+l/t- (We also make some sign changes to 
account for the max-min formulation here, versus the min-max formulation used for MEB above.) 



This allows the use of Lemma A. 4 for essentially the same analysis as for MEB; the gradient bound 



G and Hessian bound H are both at most 2, again assuming that all Ai E B. 
MEB When the b(i) < H^H 2 , we have 



maxminfo(z) + 2AiX — \\x\\ 2 = min max||Aj|| 2 — 2A{X + \\x\\ 2 = min max||x — A{ U . 



I -X. 1 1 -V / I 

rreR d i W " " x£R d i " "" ' xeM d * 



the objective function for the MEB problem. 



Margin Estimation When b in the primal problem (13), that problem is one of finding the 
shortest vector in the polytope {A T p | p E A}. Considering this case of the dual problem (18), for 
any given x G M. d with mhij Aix < 0, the value of j3 G M such that fix maximizes minj 2Aij3x — ||/3x|| 2 
is P = 0. On the other hand if x is such that minj AiX > 0, the maximizing value f5 is (5 = Aix/\\x\\ 2 , 
so that the solution of (18) also maximizes minj(^4jx) 2 /||x|| 2 . The latter is the square of the margin 



a, which as before is the minimum distance of the points Ai to the hyperplane that is normal to x 
and passes through the origin. 

Adapting Algorithm [2] for margin estimation, and with the slight changes needed for its analysis, 
we have that there is an algorithm taking 0(n/e 2 + d/epsilon) time that finds ieR l! such that, 
for all i G [n], 

2Aix - \\x\\ 2 >a 2 -e. 

When cr 2 < e, we don't appear to gain any useful information. However, when a 2 > e, we have 
min ig [ n ] AiX > 0, and so, by appropriate scaling of x, we have x such that 

a 2 = min(Aix) 2 / \\x\\ 2 = min2j4jX — ||x|| 2 > a 2 — e, 
ie[n] ie[n] 

and so o > a—e/a. That is, letting e = e'er, if e' < a, there is an algorithm taking 0(n/(ea) 2 +d/e'a) 
time that finds a solution x with a > a — e' . 



4 A Generic Sublinear Primal-Dual Algorithm 

We note that our technique above can be applied more broadly to any constrained optimization 
problem for which low-regret algorithms exist and low-variance sampling can be applied efficiently; 
that is, consider the general problem with optimum a: 

maxmincj(x) = a. (19) 

Suppose that for the set fC and cost functions q(x), there exists an iterative low regret algorithm, 
denoted LRA, with regret R(T) = o{T). Let T e {LRA) be the smallest T such that < e> We 
denote by xt+i <— LRA(xt,c) an invocation of this algorithm, when at state x± G JC and the cost 
function c is observed. 
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Algorithm 3 Generic Sublinear Primal-Dual Algorithm 



1 / logn 
100 V T ■ 



1: Let T «- max{T e (LRA), ^} , 

xi <— LRA (initial), w\ l n , 77 
for i = 1 to T do 
for % E [n] do 

Let t>t(i) Sample(xt, q) 
u*(«) <- clip({5 t (i), I/77) 

u>t+i(«) ^~ «>t(i)(l ~ + V 2v t(i) 2 ) 

end for 

Choose it E [n] by it ^— i with probability pt(i). 
x t <- LRA(x t -i, Ci t ) 
end for 

return x = A £) t Xf 



Let Sample(x, c) be a procedure that returns an unbiased estimate of c(x) with variance at 
most one, that runs in constant time. Further assume |cj(x)| < 1 for all x E K , i E [n]. 
Applying the techniques of section [2] we can obtain the following generic lemma. 

Lemma 4.1. The generic sublinear primal-dual algorithm returns a solution x that with probability 
at least g is an e-approximate solution in max{T e (LRA) , iterations. 

Proof. First we use the regret bounds for LRA to lower bound X^tepr] c it( x t)i nex t we get an 
upper bound for that quantity using the Weak Regret Lemma, and then we combine the two in 
expectation. 

By definition, Ci(x*) > a for all i E [n], and so, using the LRA regret guarantee, 



To- < max c it (x) < } c it (x t ) + R(T), 
te[T] te[T] 



(20) 



or rearranging, 



c lt (x t )>Ta-R(T). 

te[T] 



(21) 



Now we turn to the MW part of our algorithm. By the Weak Regret Lemma 2.3, and using the 
clipping of Vt(i), 

2_] pjvt < min v td) + (l°g n )/ r ? + V / ^ 
te[T] 1 J ie[T] te[T] 



T 2 
Pt v t- 



Using Lemma B.4 and Lemma B.5, since the procedure Sample is unbiased and has variance at 
most one, with high probability: 

Vi E [n] , ^ ut(i) < c ^ x *) + °( r ? T )' 
te[T] te[T] 



te[T] t 



0( V T). 
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Plugging these two facts in the previous inequality we have w.h.p, 



E Citfa) < min E c i( x t) + 0(— + V E Pt v * + ^ ^ 
ie\n] V ^-^ 

te[T] 1 J te[T] ' te[T] 



Combining (21) and (22) we get w.h.p 

Jog n 



mm 



And via Lemma 



B.8 



we have w.p. at least \ that 



min V a(x t ) > -0(^ + 7?r) - i2(T) 
1 J te[T] ' 

Dividing through by T, and using our choice of rj, we have min« ax > a — e/2 w.p. at least least 
1/2 as claimed. □ 

High-probability results can be obtained using the same technique as for linear classification. 
4.1 More applications 

The generic algorithm above can be used to derive the result of Grigoriadis and Khachiyan [0X95^ 
on sublinear approximation of zero sum games with payoffs/losses bounded by one (up to poly- 
logarithmic factors in running time). A zero sum game can be cast as the following min-max 
optimization problem: 

min max AiX 

That is, the constraints are inner products with the rows of the game matrix. This is exactly the 
same as the linear classification problem, but the vectors x are taken from the convex set K. which 
is the simplex - or the set of all mixed strategies of the column player. 

A low regret algorithm for the simplex is the multiplicative weights algorithm, which attains 
regret R(T) < 2^JT logn. The procedure Sample(x, Ai) to estimate the inner product A{X is much 
simpler than the one used for linear classification: we sample from the distribution x and return 
Ai{j) w.p. x(j). This has correct expectation and variance bounded by one (in fact, the random 



variable is always bounded by one). Lemma 4.1 then implies: 



Corollary 4.2. The sublinear primal-dual algorithm applied to zero sum games returns a solution 
x that with probability at least | is an e-approximate solution in O(-^r-) iterations and total time 
O(^). 



Essentially any constrained optimization problem which has convex or linear constraints, and 
is over a simple convex body such as the ball or simplex, can be approximated in sublinear time 
using our method. The particular application to soft margin SVM, together with its practical 
significance, is explored in ongoing work with Nati Srebro. 
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5 A Semi- Streaming Implementation 



In order to achieve space that is sublinear in d, we cannot afford to output a solution vector. We 
instead output both the cost of the solution, and a set of indices ii, . . . ,i t for which the solution is 
a linear combination (that we know) of Ai l , . . . , Ai t . We note that all previous algorithms for these 
problems, even to achieve this notion of output, required Q(d) space and/or £l(nd) time, see, e.g., 
the references in |AS10| . 

We discuss the modifications to the sublinear primal-dual algorithm that need to be done for 
classification and minimum enclosing ball problems. 

Our algorithm assumes it sees entire points at a time, i.e., it sees the entries of A row at a 
time, though the rows may be ordered arbitrarily. It relies on two streaming results about a d- 
dimensional vector x undergoing updates to its coordinates. We assume that each update is of 
the form (i,z), where i £ [d] is a coordinate of x and z S {— P, —P + 1, . . . , P} indicates that 
Xi <— Xi + z. The first is an efficient ^-sketching algorithm of Thorup and Zhang. This algorithm 
allows for (1 + ^-approximation of ||x||2 with high probability using 1-pass, 0(e~ 2 ) space, and time 
proportinal to the length of the stream. 

Theorem 5.1. ( \TZ0J$ ) There is a 1-pass algorithm which outputs a (lie)- approximation to \\x\\2 
with probability > 1 — 5 using 0(e~ 2 log(PdQ) log 1/5) bits of space and 0{Q\ogl/ 5) time, where 
Q is the total number of updates in the stream. 

The second component is due to Monemizadeh and Woodruff [MW10J. We are given a stream of 
updates to a <i-dimensional vector x, and want to output a random coordinate I 6 [d] for which for 

I 1 2 

any j € [til, PrfJ = j] = {rHI^. We also want the algorithm to return the value xi. Such an algorithm 

\\ x \\2 

is called an exact augmented ^2-Sampler. As shown in [MW10J, an augmented ^2-Sampler with 
O(logd) space, O(l) passes, and running time 0(Q) exists, where Q is the number of updates in 
the stream. This is what we use to ^2-sample from an iterate vector that we can only afford to 
represent implicitly. 

Theorem 5.2. (Theorem 1.3 of \MW10j ) There is an 0(logd)-pass exact augmented ^-Sampler 
that uses 0(log 5 (Pd)) bits of space and has running time Qlog°^ 1 \PdQ), where Q is the total 
number of updates in the stream. The algorithm fails with probability < dT c for an arbitrarily large 
constant c > 0. 

We maintain the indices it and jt used in all 0(e~ 2 ) iterations of the primal dual algorithm. 
Notice that in a single iteration t the same ^-sample index jt can be used for all n rows. While we 
cannot afford to remember the probabilities in the dual vector, we can store the values ^hj, where 
at is a (1 ± ^-approximation of \\xt\\ 2 which can be obtained using the Thorup-Zhang sketch. We 
also need such an approximation to \\xt\\ to appropriately weight the rows used to do ^-sampling 
(see below). Since we see rows (i.e., points) of A at a time, we can reconstruct the probability of 
each row in the dual vector on the fly in low space, and can use reservoir sampling to make the 
next choice of it- Then we use an augmented ^2-sampler to make the next choice of jt, where we 
must £2 sample from a weighted sum of rows indexed by i\, ... ,it in low space. We use the fact 
argued in EjC] We can show that the algorithm remains correct given the per-iteration rounding of 
the updates vt{i) to relative error fj,, where fi is on the order of rje/T. Throughout we round matrix 
entries to the nearest integer multiple of poly(l/d) for a sufficiently large polynomial. 

We implicitly represent the primal and dual vectors. At iteration t of the sublinear primal-dual 
algorithm, we have indices i\, . . . ,it-\ of the sampled rows and indices j\,. ■ ■ ,jt of the sampled 
columns for ^-sampling (in a given iteration t, we use the same column jt for ^-sampling from all 
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rows). We maintain /^-approximations • • • , ^(Vi) to SI^IJ' • • ' x t -^(j t - 1 ) • We com P ute 

k,jt+l, and a /^-approximation to 

We first determine z't in one pass. This can be done since A is presented in row order, to- 
gether with reservoir sampling. Namely, given row Af,, we compute for each 1 < t' < t — 1, a 
/^-approximation v v {k) = A k (j t >) ■ ^jjj) to v v {k) = A k (j t >) ■ x ,^ ,) , and then 

t-i 



n 

t'=i 

Thus, we can reconstruct pt(k) for use with reservoir sampling to obtain a sample it- 

In the next 0(log n) passes we obtain jt+i as follows. To ^-sample from xt, we use Theorem 5.2 
to sample a coordinate from the length- (t — l)d stream consisting of the entries of th e co ncatenated 
list: L = Ai ± , Ai 2 , . . . , Ai t _ x . Notice that y% = -7= • X^=i^%> ana - so Theorem 5.2 applied to 
L implements ^2-sampling from xt- However, the algorithm returns yt(jt) rather than xt(jt)- To 



obtain an approximation to Xt(jt), we (e/3)-approximate \\yt\\ using Theorem 5.1, from which 
x t(3t) = maxlijl^li} - We thus obtain an (e/2)-approximation ^ to 



Using Lemma 



A.2 



letting vt+i = A*., then xt+i = — — rnr 1 — irr results in an 

additive e approximation. To compute this, we must (1 ± e)-approximate ||yT+i||> which we do in 



an additional pass using Theorem 5.1 Note that we cannot afford d space, which would be required 
to compute the norm exactly. 

Theorem 5.3. There is an 0(e~ 2 )-pass, 0(e~ 2 )-space algorithm running in total time 0(e~ 4 (n + 
d)) which returns a list of T = 0(e~ 2 ) row indices i%, . . . ,ix which implicitly represent the normal 
vector to a hyperplane for e-approximate classification, together with an additive-e approximation 
to the margin. 

For the MEB problem with high probability there are only 0(e _1 ) different values of it (i.e., 
updates to the primal vector). An important point is that we can get all 0(e _1 ) ^-samples 
independently from the same primal vector between changes to it by running the algorithm of 
[MW10J independently 0(e _1 ) times in parallel. 

We spend 0((n-\-d)e~ 2 ) time per iteration, to reconstruct the dual vector and run the algorithm 
of [MWlOj independently 0(e~ l ) times on a stream of length <D{de~ l ) to do ^2-sampling). 

Minimum Enclosing Ball For the MEB problem we need the following standard tool. 

Fact 5.4. (see, e.g., [?]) Let a G {— l,l} d be uniform from a A-wise independent family of sign 
vectors. For any n-dimensional vector v, E cr [(cr, v) 2 ] = WvW 2 . and Varvf (a, v} 2 ] < 2[|i>[|2- 

Define an epoch to be a contiguous block of iterations for which xt does not change. Notice 
that xt does not change with probability 1 — a. 

We describe the necessary modifications to Algorithm [2} Throughout we round matrix entries 
to the nearest integer multiple of poly(l/c£) for a sufficiently large polynomial. We use the fact 
argued in £jC] that the algorithm remains correct given the per-iteration rounding of the updates 
vt(i) to relative error [/,, where [x is on the order of ne/T. 

We will not compute ||xt|| in each epoch. This would require Q.(d) space. However, unlike 



in the case of classification, for the MEB problem we cannot even afford to use Theorem 5.1 to 
approximate ||xt|| 2 , as that would cost Q(e~ 2 ) space. Instead, we will use Fact 5.4 to obtain an 
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unbiased estimator of \\xt || 2 , which suffices for our analysis to go through. Namely, by the triangle 



inequality, ||xt|| < 1 (since we divide by t), and so the estimator of Fact 5.4 has variance 0(1). 

Again, we implicitly represent the primal and dual vectors. We only store one index i s and j s 
per epoch s. In epoch s, we have indices ii, . . . ,i s , which correspond to indices of the row A% chosen 
for use to update the primal vector in the current and previous epochs. As in the non-streaming 
version of this algorithm, we use the same coordinate j s for ^-sampling in all iterations in an epoch 
and for all rows. Hence, throughout the course of the algorithm, the expected number of indices 
i s and j s that the algorithm stores is the number O(aT) = 0(e _1 ) of epochs. The algorithm also 
stores the number m s of iterations in each epoch in the same amount of space. 

At the beginning of the s-th epoch, we have maintained /^/2-approximations , , . . . , ^ — — ^ 

to — fw, . . . , h — v We compute i s ,j s , and a u/2-approximation - }■ to — h-x. 

Xl{J\) X s — iyj 3 —l) • X S {J 3 ) x s{Js) 

We first determine i s in one pass. This can be done as in classification since A is presented in row 
order, together with reservoir sampling. Namely, given row A k , we compute for each 1 < s' < s — 1, 
a /^-approximation v s >(k) = A k (j s >) ■ j^j^ to v s >(k) = A k (j s >) ■ x y ) , and then 



1 8-1 

Ps(k) = - • TT (1 + r,v s >(k) + v 2 v 2 s ,(k)) r 
n J r ± 



Thus, we can reconstruct p s (k) for use with reservoir sampling to obtain a sample i s . 



In the next O(logn) passes we obtain j s as follows. To ^-sample from x s , we use Theorem 5.2 



to 



sample a coordinate from the length-(s— l)d stream consisting of the e ntrie s of the concatenated list: 



L = Ai 1 , Ai 2 , . . . , Ai s _ 1 . Notice that y s = Ylj=i Ai 3 , and so Theorem 5.2 applied to L implements 



^2-sampling from y s , and hence x s as well. We thus obtain an (e/2)-approximation -/■■■> to 



Applying Fact 5.4, we obtain 



X 3 (j s ) X S (js)' 



Theorem 5.5. Given the norms of each row Ai, there is an d(e~ 1 )-pass, 0(e~ 2 )-space algorithm 
running in total time 0(e~ 3 (n + d)) which returns a list ofT = 0(e _1 ) row indices i\, ... ,ix which 
implicitly represent the MEB center, together with an additive e- approximation to the MEB radius. 



6 Kernelizing the Sublinear algorithms 

An important generalization of linear classifiers is that of kernel-based linear predictors (see e.g. 
[SS03J). Let : R d ^ H be a mapping of feature vectors into a reproducing kernel Hilbert space. 
In this setting, we seek a non-linear classifier given by h £ % so as to maximize the margin: 

a = maxmin(/i, ^(Aj)). 

h£H ie[n] 

The kernels of interest are those for which we can compute inner products of the form k(x,y) = 
efficiently. 

One popular kernel is the polynomial kernel, for which the corresponding Hilbert space is the 
set of polynomials over M d of degree q. The mapping ^ for this kernel is given by 

VSC [d] , \S\<q. *(x) S = Y[xi. 
That is, all monomials of degree at most q. The kernel function in this case is given by k(x, y) = 

II II 2 

(x T y) q . Another useful kernel is the Gaussian kernel k(x,y) = exp(— ^2 ) ; where s is a param- 
eter. The mapping here is defined by the kernel function (see [SS03j for more details). 
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Algorithm 4 Sublinear Kernel Perceptron 



1: Input: e > 0, A G M" xd with A; G IB for i E [n]. 

2: Let T <- 200 2 £~ 2 logn, j/i <- 0, uji <- l n , 77 <- looy^^- 

3: for i = 1 to T do 

5: Choose it G [n] by it <— i with probability pt(i). 

6: yt+l <r- Er6M *(AJ/V^T. 

7: for i G [n] do 

8: ut(i) <— Kernel-L2-Sampling(xt, \]/(Ai)). (estimating (a^, ^(Aj)}) 

9: u t (i) «- clip(t5 t (i), I/77). 

10: io t+ i(i) <- w t (i)(l - T7« t (i) + rfvtii) 2 ). 

11: end for 

12: end for 

13: return x = A Et x * 



The kernel version of Algorithm [T] is shown in Figure [4} Note that xt and yt are members of H, 
and not maintained explicitly, but rather are implicitly represented by the values it- (And thus \\yt\\ 
is the norm of Ti, not Also, 'I'(A) is not computed. The needed kernel product (xt, ^(A)) is 
estimated by the procedure Kernel-L2-Sampling, using the implicit representations and specific 
properties of the kernel being used. In the regular sublinear algorithm, this inner product could be 
sufficiently well approximated in 0(1) time via ^2-sampling. As we show below, for many interesting 
kernels the time for Kernel-L2-Sampling is not much longer. 

For the analog of Theorem 2.7 to apply, we need the expectation of the estimates vt(i) to be cor- 
rect, with variance 0(1). By Lemma C.l it is enough if the estimates vt(i) have an additive bias of 
0(e). Hence, we define the procedure Kernel-L2-Sampling to obtain such an not-too-biased esti- 
mator with variance at most one; first we show how to implement Kernel-L2-Sampling, assuming 
that there is an estimator k() of the kernel k() such that ~E[k(x,y)] = k(x,y) and Var(/c(x,y)) < 1, 
and then we show how to implement such kernel estimators. 



6.1 Implementing Kernel-L2-Sampling 

Estimating \\y\\t A key step in Kernel-L2-Sampling is the estimation of \\yt\\, which readily 
reduces to estimating 

Y t ^2T\\y t \\ 2 /t 2 = ^ k ( A ir,Ai Tl ), 

T,r'E[t] 

that is, the mean of the summands. Since we use max{l, ||yt ||), we need not be concerned with 
small ||yt||, and it is enough that the additive bias in our estimate of Y be at most e/T < e{2T/t 2 ) 
for t € [T], implying a bias for \\yt\\ no more than e. Since we need l/||j/t|| in the algorithm, it is not 
enough for estimates of Y just to be good in mean and variance; we will find an estimator whose 
error bounds hold with high probability. 

Our estimate 1< of It can first be considered assuming we only need to make an estimate for a 
single value of t. 

Let N Y <- t 2 \(8/3)log(l/5)T 2 /e 2 t 2 ]. To estimate Y t , we compute, for each r,r' G [t], n t «- 
Ny/t 2 independent estimates 

X T y- m <- clip(k(Ai T ,A iT ,),T/e),iov m G [n t ], 
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and our estimate is 

Y t <- Y, X r,T>,m/N Y . 

r,r'e[t] 
me[n t ] 

Lemma 6.1. With probability at least 1 — 8, \Y — Yt\ < e/T. 



Proof. We apply Bernstein's inequality (as in 32) to the N Y random variables X T T i m — ~E[X T T i m ]. 
which have mean zero, variance at most one, and are at most T/e in magnitude. Bernstein's 
inequality implies, using Vax[X TT i m ] < 1, 



logProb{ Y ( X ry,m ~ nX T y, m ]) > «} < -a 2 /(N Y + (r/e)a/3), 

r,T'E[t] 
m£[n t ] 

and putting a N Y e/T gives 



logProb{Y - E[Y] > e/T} < -N Y (e/T) 2 /(N Y + (T/e)iVy (e/T)/3) 

< -(8/3)log(l/<5)(3/4) < -21og(l/<5). 

Similar reasoning for —X TT / m , and the union bound, implies the lemma. □ 

To compute Y for t = 1 . . . T, we can save some work by reusing estimates from one t to the 
next. Now let N Y <- [(8/3) log(l/5)T 2 /e 2 ] . Compute Y x as above for t = 1, and let Y x <- Y\. For 
t > 1, let n t <- \N Y /t 2 ] , and let 

Y t ^ Xt,t,m/TH+ ^ PQ.r.m + X T! t,m)/n t , 

me[nt] TS[t] 
mg[n t ] 

and return Yt <— ^reW ^/t 2 - 

Since for each r and r', the expected total contribution of all X Tir / jTO terms to It is &(>lj T , , ), 
we have E[Y t ] = Y t . Moreover, the number of instances of X T y^ m averaged to compute Y t is always 
at least as large as the number used for the above "batch" version; it follows that the total variance 



of Yt is non-increasing in t, and therefore Lemma 6.1 holds also for the It computed stepwise. 



Since the number of calls to k(, ) is X^e[T](l + ^ n t) = 0(N Y ), we have the following lemma. 

Lemma 6.2. The values Y(t 2 /2T) ~ \\yt\\, t £ [T], can be estimated with 0((log(l/ee))T 2 /e 2 ) calls 
to k(, ), so that with probability at least 1 — 8, \Yt(t 2 /2T) — \\yt\\ \ < e. The values \\yt\\, t 6 [T], can 
be computed exactly with T 2 calls to the exact kernel k(,). 

Proof. This follows from the discussion above, applying the union bound over t £ [T], and adjusting 
constants. The claim for exact computation is straightforward. □ 

Given this procedure for estimating \\yt\\, we can describe Kernel-L2-Sampling. Since xt+i = 
yt+l/ max{l, ||yt+i||}, we have 

(xt+uA) = — 1 m E(*(^)>*(^)) 

max{l,||yt+i||}V2r^' ] 



=J^k(A iT ,Ai), (23) 



max{l,||y t+ i||}V2r re[t] 
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so that the main remaining step is to estimate ^ re w k(Ai T ,Ai), for i G [n]. Here we simply call 

k(Ai T ,Ai) for each r. We save time, at the cost of 0(n) space, by saving the value of the sum for 
each i £ [n], and updating it for the next t with n calls k(Ai t ,Ai). 

Lemma 6.3. Let L\~ denote the expected time needed for one call to k(,), and denote the time 
needed for one call to k(,). Except for estimating \\yt\\, Kernel-L2-Sampling can be computed 
in nLk expected time per iteration t. The resultingestimate has expectation within additive e of 
(xt,Ai), and variance at most one. Thus Algorithm^ runs in time 0{ ( Lk ™+ d ^ _|_ minj^ , pr}), and 
produces a solution with properties as in AlgorithmUi 



Proof. For Kernel-L2-Sampling it remains only to show that its variance is at most one, given 



that each k(, ) has variance at most one. We observe from (23 that t independent estimates 
k(,) are added together, and scaled by a value that is at most l/\/2T. Since the variance of 
the sum is at most t, and the variance is scaled by a value no more than 1/2T, the variance of 
Kernel-L2-Sampling is at most one. The only bias in the estimate is due to estimation of \\yt\\, 
which gives relative error of e. For our kernels, ^(uJH < 1 if v G B, so the additive error of 
Kernel-L2-Sampling is 0(e). 

The analysis of Algorithm [4] then follows as for the un-kernelized perceptron; we neglect the 
time needed for preprocessing for the calls to k(, ), as it is dominated by other terms for the kernels 
we consider, and this is likely in general. □ 



6.2 Implementing the Kernel Estimators 

Using the lemma above we can derive corollaries for the Gaussian and polynomial kernels. 



Polynomial kernels For the polynomial kernel of degree q, estimating a single kernel product, 
i.e. k(x,y) = k(Ai,Aj), where the norm of x,y is at most one, takes 0(q) as follows: Recall that 
for the polynomial kernel, k(x,y) = (x T y) q . To estimate this kernel we take the product of q 
independent ^-samples, yielding k(x,y). Notice that the expectation of this estimator is exactly 
equal to the product of expectations, E[/c(x,y)] = (x T y) q . The variance of this estimator is equal 
to the product of variances, which is \~ar(k(x, y)) < (\\x\\ \\y\\) g < 1. Of course, calculating the 
inner product exactly takes 0(d\ogq) time. We obtain: 

Corollary 6.4. For the polynomial degree-q kernel, Algorithm^ runs in time 

^,q{n + d) . f dlogq q 
°( £ 2 + mm{^^, -g}). 

Gaussian kernels To estimate the Gaussian kernel function, we assume that ||x|| and ||y|| are 
known and no more than s/2; thus to estimate 

k(x,y) = exp(||x - y\\ 2 ) = exp((||x|| 2 + ||y|| 2 )/2s 2 ) exp(x T y/s 2 ), 

we need to estimate exp(a; T y/s 2 ). For exp(7X) = 5^£>o T*-^* A' w ^ n ran dom X and parameter 
7 > 0, we pick index i with probability exp(— 7)7"* A' (that is, i has a Poisson distribution) and 
return exp(7) times the product of i independent estimates of X. 

In our case we take X to be the average of c ^-samples of x T y, and hence E[-X] = x T y , E[X 2 ] < 
i E[(x T y) 2 ] < -. The expectation of our kernel estimator is thus: 

i 

B[k(x, y)} = E[J2 e~V*! ■ e 7 • X') = J] fi\ J] B[X) = exp( 7 x T 2/ ). 

i>0 i>0 j=l 
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The second moment of this estimator is bounded by: 

E[k(x, yf] = E[J2 e-Vi! ■ ^ • (X 1 ) 2 ] = ^ fl E[A 2 ] < exp(^). 

i>0 i>0 j=l 

Hence, we take 7 = c = -j-. This gives a correct estimator in terms of expectation and constant 
variance. The variance can be further made smaller than one by taking the average of a constant 
estimators of the above type. 

As for evaluation time, the expected size of the index i is 7 = \. Thus, we require on the 
expectation 7 x c = of ^-samples. 

We obtain: 

Corollary 6.5. For the Gaussian kernel with parameter s, Algorithm^ runs in time 
6.3 Kernelizing the MEB and strictly convex problems 

Analogously to Algorithm |4j we can define the kernel version of strongly convex problems, including 
MEB. The kernelized version of MEB is particularly efficient, since as in Algorithm [2j the norm 
\\yt\\ is never required. This means that the procedure Kernel-L2-Sampling can be computed in 
time O(nLfc) per iteration, for a total running time of 0{Lk(e~ 2 n + e~ 1 d)). 

7 Lower bounds 

All of our lower bounds are information-theoretic, meaning that any successful algorithm must read 
at least some number of entries of the input matrix A. Clearly this also lower bounds the time 
complexity of the algorithm in the unit-cost RAM model. 

Some of our arguments use the following meta-theorem. Consider a p x q matrix A, where p is 
an even integer. Consider the following random process. Let W > q. Let a = 1 — 1/W, and let ej 
denote the j-th standard q- dimensional unit vector. For each i G [p/2], choose a random j G [q] 
uniformly, and set A i+p / 2 ^— Ai <— aej + b(l q — ej), where b is chosen so that ||^4i||2 = 1- We say 
that such an A is a YES instance. With probability 1/2, transform A into a NO instance as follows: 
choose a random i* G [p/2] uniformly, and if Ai* = aej + b(l q — ej) for a particular j* G [q], set 
A i*+ P /2 < aej* + b(l q - ej*). 

Suppose there is a randomized algorithm reading at most s positions of A which distinguishes 
YES and NO instances with probability > 2/3, where the probability is over the algorithm's coin 
tosses and this distribution [i on YES and NO instances. By averaging this implies a deterministic 
algorithm Alg reading at most s positions of A and distinguishing YES and NO instances with 
probability > 2/3, where the probability is taken only over fi. We show the following meta-theorem 
with a standard argument. 

Theorem 7.1. (Meta-theorem) For any such algorithm Alg, s = Q(pq). 

This Meta-Theorem follows from the following folklore fact: 

Fact 7.2. Consider the following random process. Initialize a length-r array A to an array of r 
zeros. With probability 1/2, choose a random position i G [r] and set A[i] = 1. With the remaining 
probability 1/2, leave A as the all zero array. Then any algorithm which determines if A is the all 
zero array with probability > 2/3 must read £l(r) entries of A. 
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Let us prove Theorem 7.1 using this fact: 



Proof. Consider the matrix B 6 r(p/ 2 ) X| J which is defined by subtracting the "bottom" half of the 
matrix from the top half, that is, B^j = Aij — A i+p / 2 j- Then B is the all zeros matrix, except 
that with probability 1/2, there is one entry whose value is roughly two, and whose location is 
random and distributed uniformly. An algorithm distinguishing between YES and NO instances of 
A in particular distinguishes between the two cases for B, which cannot be done without reading 
a linear number of entries. □ 



In the proofs of Theorem 7.3, Corollary 7.4, and Theorem 7.6 it will be more convenient to 



use M as an upper bound on the number of non-zero entries of A rather than the exact number of 
non-zero entries. However, it should be understood that these theorems (and corollary) hold even 
when M is exactly the number of non-zero entries of A. 

To see this, our random matrices A constructed in the proofs have at most M non-zero entries. 
If this number M' is strictly less than M, we arbitrarily replace M — M' zero entries with the value 
(nd)~ c for a large enough constant C > 0. Under our assumptions on the margin or the minimum 
enclosing ball radius of the points, the solution value changes by at most a factor of (1 ± (nd) 1 ~ c ), 
which does not affect the proofs. 



7.1 Classification 

Recall that the margin o~{A) of an n x d matrix A is given by max^gg min, A^x. Since we assume 
that ||vlj[|2 < 1 for all i, we have that o~(A) < 1. 



7.1.1 Relative Error 

We start with a theorem for relative error algorithms. 

Theorem 7.3. Let n > be a sufficiently small constant. Let e and o~(A) have o~(A)~ 2 e~ 1 < 
Kimn(n,d), o~(A) < 1 — e, with e also bounded above by a sufficiently small constant. Also assume 
that M > 2(n + d), that n > 2, and that d > 3. Then any randomized algorithm which, with 
probability at least 2/3, outputs a number in the interval [o~(A) — ea(A),a(A)] must read 

0(min(M, a(A)' 2 e- 1 (n + d))) 

entries of A. This holds even if \\Ai\\2 = 1 for all rows Ai. 

Notice that this yields a stronger theorem than assuming that both n and d are sufficiently 
large, since one of these values may be constant. 

Proof. We divide the analysis into cases: the case in which d or n is constant, and the case in which 
each is sufficiently large. Let r E [0, 1 — e] be a real number to be determined. 



Case: d or n is a constant By our assumption that a(A)~ 2 e~ 1 < «min(n, d), the values cr(A) 
and e are constant, and sufficiently large. Therefore we just need to show an 0(min(M, n + d)) 
bound on the number of entries read. By the premise of the theorem, M = £l(n + d), so we can 
just show an Q(n + d) bound. 

An Q(d) bound. We give a randomized construction of an n x d matrix A. 

The first row of A is built as follows. Let Ax t i r and A\p, 0. Pick j* € {3, 4, . . . , d} 
uniformly at random, and let A\j* 4— e 1 / 2 r. For all remaining j £ {3,4, . . . ,d}, assign Aij (, 
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where £ <— 1 /(i 3 . (The role of £ is to make an entry slightly non-zero to prevent an algorithm which 
has access to exactly the non-zero entries from skipping over it.) Now using the conditions on r, 
we have 

X <- pill 2 = r 2 + (d - 3)C 2 + st 2 < (1 - ef + cT 2 + e < 1 - e + e 2 + kV < 1, 

and so by letting A,2 v 7 ! — ^ we have ||A|| = 1. 

Now we let A2 < A\, with two exceptions: we let A,i ^- ^1,1 = T > and with probability 1/2, 

we negate ^2,1*- Thus ||A|| = 1 also. 

For row i with i > 2, put «— (1 + e)r, A, 2 yl — j4 2 x , and all remaining entries zero. 

We have the following picture. 



r 


(1- 


-r 2 - 


(d-3)C 2 - 


-er 2 )V2 


c ■• 


• C e 1/2 r C •• 


• C 
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"(1 


-r 2 - 


-(d-3)C 2 


- er 2 ) 1 ^ 


-c •• 


• -C ±e l/2 r -C •• 


• -c 


(l + e)r 




(1- 


(l + ,) 2 r 2 


)l/2 


• • 







(l + e)r 
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(l + s)V 


,1/2 


• • 







(l + e)r 




(1- 


(l + e) 2 r 2 


jl/2 


• • 








Observe that the the number of non-zero entries of the resulting matrix is 2n + 2d 
satisfies the premise of the theorem. Moreover, all rows A{ satisfy \\Ai\\ = 1. 



4, which 



Notice that if A\ 



A2J*, then the margin of A is at most r, which follows by observing 



that all but the first coordinate of A\ and A2 have opposite signs. 

On the other hand, if Ai j* = A2J* , consider the vector y with y\ 
entries zero. Then for all i, A{y = r(l + e), and so the unit vector x 



■ 1, yj* <— y/s, and all other 

y/\\y\\ has 



AiX 



r(l + e) 



r(l + e) 1 / 2 =<r(l + n(e)). 



It follows that in this case the margin of A is at least r(l + ^(e)). Setting r = 0(a) and rescaling 
e by a constant factor, it follows that these two cases can be distinguished by an algorithm satis- 



fying the premise of the theorem. By Fact 7.2, any algorithm distinguishing these two cases with 
probability > 2/3 must read £l(d) entries of A. 



An f2(n) bound. We construct the n x d matrix A as follows. All but the first two columns 
are 0. We set Ai t i <— r and Aj^ <— \/1 — t 2 for all i G [n]. Next, with probability 1/2, we pick a 
random row i* , and negate A*, 2- We have the following picture. 
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The number of non-zeros of the resulting matrix is 2n < M. Depending on the sign of A*,2> the 
margin of A is either 1 or r. Setting r = O(cr), an algorithm satisfying the premise of the theorem 
can distinguish the two cases. By Fact |7.2[ any algorithm distinguishing these two cases with 
probability > 2/3 must read f2(re) entries of A. 
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Case: d and n are both sufficiently large Suppose first that M = Q(a(A)~ 2 E~ 1 (n + d)) 
for a sufficiently large constant in the Let s be an even integer in @(r~ 2 e _1 ) and with 

s < mm(n,d) — 1. We will also choose a value r in 0(cr(A)). We can assume without loss of 
generality that n and d are sufficiently large, and even. 

An £l(ns) bound. We set the d-th entry of each row of A to the value r. We set all entries 
in columns s + 1 through d — 1 to 0. We then choose the remaining entries of A as follows. We 
apply Theorem 7.1 with parameters p = n, q = s, and W = d 2 , obtaining annxs matrix where 
1 1 = 1 for all rows Bi. Put B' <- By/ 1 - t 2 . We then set Ay «- By- for all i € [n] and j G [s]. 
We have the following block structure for A. 



By/l - T 2 nx(d _ s _i) 1„T 



Here O n x(d—s—i) is a matrix of all O's, of the given dimensions. Notice that ||Aj|| = 1 for all rows 
Ai, and the number of non-zero entries is at most n(s + 1), which is less than the value M. 

We claim that if B is a YES instance, then the margin of A is r(l + Indeed, consider 

the unit vector x for which 



For any row Ai, 



2\ 1/2 
4s 





Ll-e/2 



j G [a + l,d-l] 
j = d 



(24) 



Ax > 



> 



> 




0(e 2 r 2 ) 



+ 1 



+ T 



cT 



since 



,2 > j 



since i/-- — = cy(e r 



2_2^ 



If we set s = ct e for c G (0, 4), then 



A J x>r+-^-r(| + -^) - r(i,-Q„ )) . 



(25) 



On the other hand, if B is a NO instance, we claim that the margin of A is at most t(1 + 0(£ 2 )). By 
definition of a NO instance, there are rows Aj and Aj of A which agree except on a single column 



k, for which A 



1— r* 



while A 



Aj . It follows that the x which maximizes 



mm{AiX,A jX } has x k = 0. But £fcVfcA ? ,fc' = 1 - (1 - r 2 ) + O (^) =r 2 + 0(^). Since ||x|| < 1, 
by the Cauchy-Schwarz inequality 



A lX = A jX < (r 2 + o(Jp^ 1 <r + 0(e 2 ) =r(l + 0(e 2 )), 



(26) 



where the first inequality follows from our bound r 2 e 1 = 0(d). 

Setting r = B(<r(A)) and rescaling e by a constant factor, an algorithm satisfying the premise 



of the theorem can distinguish the two cases, and so by Theorem 7.1, it must read Q(ns) 
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£l(a(A) 2 e l n) entries of A. 



An £l(ds) bound. We first define rows s + 1 through n of our n x d input matrix A. For % > s, put 
Ai t d ^— t(1 + e), Ai t d-i 4— (1 — t 2 (1 + e) 2 ) 1 / 2 , and all remaining entries zero. 

We now define rows 1 through s. Put Aj<j ^— r for all t£ [s]. Now we apply Theorem 7.1 with 
p = s, q = d — 2, and VF = ci 2 , obtaining an s x (d — 2) matrix B, where = 1 for all rows B{. 
Put B' 



bVT 

structure for A. 



t 2 , and set Aij <— B[ ■ for all i G [s] and j G [d — 2]. We have the following block 







(n-s)x(d-2) 



ln-«(l 



T 2 (1+e) 2U/2 



1 S T 

,r(l + e) 



Notice that ||Aj|| = 1 for all rows Aj, and the number of non-zero entries is at most 2n + sd < M. 



If B is a YES instance, let x be as in Equation (|24|). Since the first s rows of A agree with those 

r(l + 0(e)) for i G [s]. 



in our proof of the ri(ns) bound, then as shown in Equation (25), AiX 
Moreover, for i > s, since YES instances B are entry-wise positive, we have 

A iX > (l - | 

Hence, if i? is a YES instance the margin is t(1 + fi(e)) 



•r(l + e) =r(l + 0(e)). 



Now suppose i3 is a NO instance. Then, as shown in Equation (26), for any x for which 
ll^ll < 1) w e have j4jX < r(l + 0(e 2 )) for i £ [s]. Hence, if B is a NO instance, the margin is at 
most t(1 + 0(e 2 )). 

Setting r = Q(a(A)) and rescaling e by a constant factor, an algorithm satisfying the premise 



of the theorem can distinguish the two cases, and so by Theorem 7.1 it must read £l(ds) 
0,{a(A)~ 2 e~ l d) entries of A. 



Finally, if M = 0((n + d)a(A)~ 2 e~ 1 ), then we must show an Q(M) bound. We will use our 
previous construction for showing an f2(ns) bound, but replace the value of n there with n' , where 
n' is the largest integer for which n's < M/2. We claim that n' > 1. To see this, by the premise 
of the theorem M > 2(n + d). Moreover, s = 0(e _1 ) and e _1 < n[n + d). For a small enough 
constant k > 0, s < (n + d) < M/2, as needed. 

As the theorem statement concerns matrices with n rows, each of unit norm, we must have an 
input A with n rows. To achieve this, we put Aj^ = r(l + e) and Ai^-i = (1 — t 2 (1 + e) 2 ) 1 / 2 
for all i > n' . In all remaining entries in rows Ai with i > n' , we put the value 0. This ensures 
that 1 1 -Aj| | = 1 for all i > n' , and it is easy to verify that this does not change the margin of A. 
Hence, the lower bound is Q(n's) = O(JVf). Notice that the number of non-zero entries is at most 
In + n's < 2M/3 + M/3 = M, as needed. 

This completes the proof. □ 



7.1.2 Additive Error 

Here we give a lower bound for the additive error case. We give two different bounds, one when 
e < o~, and one when e > a. Notice that a > since we may take the solution x = 0^. The 
following is a corollary of Theorem |7.3| 

Corollary 7.4. Let k > be a sufficiently small constant. Let e,a(A) be such that a(A)~ 1 e~ 1 < 
nmm(n,d) and o~(A) < 1 — e/a(A), where < e < k'o~ for a sufficiently small constant k' > 0. 
Also assume that M > 2(n + d), n > 2, and d > 3. Then any randomized algorithm which, with 
probability at least 2/3, outputs a number in the interval [a — e,a] must read 

0(min(M, <r~ 1 e~ 1 (n + d))) 
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entries of A. This holds even if \\Ai\\ = 1 for all rows A^. 



Proof. We simply set the value of e in Theorem 7.3 to e/a. Notice that e is at most a sufficiently 
small constant and the value a~ 2 e~ 1 in Theorem 



7.3 



equals a e 



by the premise of the corollary, as needed to apply Theorem |7.3 
The following handles the case when e = Q(o~). 



which is at most k min(n, d) 

□ 



Corollary 7.5. Let k > be a sufficiently small constant. Let e, o~(A) be such that e 2 < 
Kmin(n,d), o~(A) + e < and e = f2(cr). Also assume that M > 2(n + d), n > 2, and d > 3. 
Then any randomized algorithm which, with probability at least 2/3, outputs a number in the interval 
[a — e,a] must read 

fi(min(Af, £- 2 (n + d))) 
entries of A. This holds even if \\Ai\\ = 1 for all rows Ai. 



Proof. The proof is very similar to that of Theorem 7.3, so we just outline the differences. In the 
case that d or n is constant, we have the following families of hard instances: 



An O(n) bound for constant d: 

( T 

_(l_ r 2_ (d _ 3)C 2_ 2(e + r) 2)l/2 



'l_ T 2_ (d _ 3)C 2_ 2(e + r) 2)l/2 



2(e + r) 
2(e + t) 



[l-2(e + r) 2 ) 1/2 
[1 - 2(e + r) 2 f 2 



V V2(e + t) 



[l-2{e + r) 2 ) 



2\V2 









C V2(e + r) C 
-C ±V2(e + t) -C 



c \ 

-c 

o 
o 

o J 



An Cl(d) bound for constant n: 
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Vi- 


T 2 


o •• 










•• 
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T 


Vi- 


T 2 


•• 
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±Vi- 


-T 2 


•• 
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Vi- 


T 2 


•• 


• 








•• 
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T 


Vi- 


T 2 


•• 


• °; 



In these two cases, depending on the sign of the undetermined entry the margin is either r or 
at least r + e (in the fl(d) bound, it is r or 1, but we assume r + s < It follows for r = o~(A), 
the algorithm of the corollary can distinguish these two cases, for which the lower bounds follow 



from the proof of Theorem 7.3 



For the case of n and d sufficiently large, we have the following families of hard instances. In 

Q{e- 2 ). 



each case, the matrix B is obtained by invoking Theorem 7.1 with the value of s 
An Q,(ne~ 2 ) bound for n, d sufficiently large: 



By/l 







nx (d— s— 1) 
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An £l{de 2 ) bound for n, d sufficiently large: 



(n-*)x(d-2) 



0. s 







In these two cases, by setting W 



l n __ s (l-(r + e) 2 ) 



2U/2 



1.T 

l n _ s (r + e) 



poly(nd) to be sufficiently large in Theorem 7.1 depending on 

or at least r+\/T" 



whether is YES or a NO instance the margin is either at most r+ poly ^ TO ^ 



(for an appropriate choice of s). For r < l/\/2, the algorithm of the corollary can distinguish these 
two cases, and therefore needs Q(ns) time in the first case, and Q(ds) time in the second. 

The extension of the proofs to handle the case M = o((n + d)e~ 2 ) is identical to that given in 
the proof of Theorem |7.3| □ 



7.2 Minimum Enclosing Ball 

We start by proving the following lower bound for estimating the squared MEB radius to within 
an additive e. In the next subsection we improve the fi(e -1 n) term in the lower bound to fl(e~ 2 n) 
for algorithms that either additionally output a coreset, or output a MEB center that is a convex 
combination of the input points. As our primal-dual algorithm actually outputs a coreset, as well 
as a MEB center that is a convex combination of the input points, those bounds apply to it. Our 
algorithm has both of these properties though satisfying one or the other would be enough to apply 
the lower bound. Together with the e~ l d bound given by the next theorem, these bounds establish 
its optimality. 

Theorem 7.6. Let k > be a sufficiently small constant. Assume e^ 1 < nmm(n,d) and e is less 
than a sufficiently small constant. Also assume that M > 2(n + d) and that n > 2. Then any 
randomized algorithm which, with probability at least 2/3, outputs a number in the interval 

minmax||x — Ai\\ 2 — e, minmax||a; — Ai\\ 2 

X % XI 

must read 

fi(min(M, e _1 (n + d))) 
entries of A. This holds even if \\Ai\\ = 1 for all rows A4. 

Proof. As with classification, we divide the analysis into cases: the case in which d or n is constant, 
and the case in which each is sufficiently large. 



Case d or n is a constant By our assumption that e _1 < remm(n, d), e is a constant, and 
sufficiently large. So we just need to show an Q(min(M, n + d)) bound. By the premise of the 
theorem, M > 2(n + d), so we need only show an f2(n + d) bound. 

An 0(d) bound. We construct antixd matrix A as follows. For i > 2, each row A% is just the 
vector e\ = (1, 0, 0, ... , 0). 

Let A\ \ 0, and initially assign £ 1/d to all remaining entries of Ai. Choose a random 
integer j* £ [2,d], and assign Aij. <- - (d-2)( 2 . Note that = 1. 

Let A2 < Ai, and then with probability 1/2, negate Aij*. 

Our matrix A is as follows. 
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C y/l-(d-2)C 2 C ••• C \ 
-C ±y/l-\d-2)C 2 "C ••• "C 




o J 

Observe that A has at most 2n + 2d < M non-zero entries, and all rows satisfy ||Aj[| = 1. 



/ o 


c ••• 





-c ••• 


1 


o ••• 


1 


o ••• 


1 




V i 


o ••• 



If A 



-A2J*, then Ai and A2 form a diametral pair, and the MEB radius is 1. 



On the other hand, if A\ ? * = A 



all other entries zero. Then for alii > 2, 
we have 



1/V2, and 

1 j= ) . On the other hand, for i 6 {1,2}, 



2 j* , then consider the ball center x with x\ 
/ , \ 2 

\ r-Ai\\ 2 - , x 



.4; 



<- + (d-2)C 2 +(l 



< 2 



v / 2 + i 



1 

V2j ~~ d' 

It follows that for e satisfying the premise of the theorem, an algorithm satisfying the premise of 



the theorem can distinguish the two cases. By Fact 7.2 any algorithm distinguishing these two 
cases with probability > 2/3 must read 0(d) entries of A. 



An Q(ri) bound. We construct the n x d matrix A as follows. Initially set all rows Ai 
(1, 0, 0, . . . , 0). Then with probability 1/2 choose a random i* £ [n], and negate Ai* t \. 
We have the following picture. 



ei 
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• 
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• 






±1 
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• 
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• 
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• 
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• 


/ 



The number of non-zeros of the resulting matrix is n < M. In the case where there is an entry of 
— 1, the MEB radius of A is 1, but otherwise the MEB radius is 0. Hence, an algorithm satisfying 



the premise of the theorem can distinguish the two cases. By Fact 7.2, any algorithm distinguishing 
these two cases with probability > 2/3 must read O(n) entries of A. 



Case: d and n are sufficiently large Suppose first that M = il(e _1 (n + d)) for a sufficiently 
large constant in the 0(). Put s = 0(e _1 ). We can assume without loss of generality that n, d, 
and s are sufficiently large integers. We need the following simple claim. 

Claim 7.7. Given an instance of the minimum enclosing ball problem in T > t dimensions on a 
matrix with rows {aei + /3 Sje[t]\{i} e jYi=i f or distinct standard unit vectors e% and a > (3 > 0, the 
solution x = Yll=i( a + — l)/^) e «A °f cos t ( a ~ /^) 2 (1 — 1/*) * s optimal. 

Proof. We can subtract the point /31t from each of the points, and an optimal solution y for the 
translated problem yields an optimal solution y + /31t for the original problem with the same cost. 
We can assume without loss of generality that T = t and that ex, ■ ■ ■ ,et are the t standard unit 
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vectors in K*. Indeed, the value of each of the rows on each of the remaining coordinates is 0. The 
cost of the point = Yli=i( a ~ P) e i/t m t ne translated problem is 



a 



1 



+ (t - 1) (a - (3) 2 /t 2 = (a - P) 2 1 



On the other hand, for any point y, the cost with respect to row i is (a — (3 — y/) 2 + Y^j&iP ~ Vj) 2 - 
By averaging and Cauchy-Schwarz, there is a row of cost at least 



.i=i 



yi ) 2 + (t-l)^y 2 
i=i 



= \\ y \\ 2 + (a-f3) 2 
>\\ y f + (a-f3) 2 



2(a 



y i 



2(a-^)|M 



Taking the derivative w.r.t. to ||y||, this is minimized when ||y|| 
least (a- f3) 2 (l-l/t). 

An Q(ns) bound. We set the first s rows of A to ei, . . . , e s . We set all entries outside of the first 



for which the cost is at 

□ 



s columns of A to 0. We choose the remaining n — s = Q(n) rows of A by applying Theorem 7.1 
with parameters p = n — s,q = s, and W = 1/d. If A is a YES instance, then by Claim 7.7 there 
is a solution with cost (a — 6) 2 (1 — 1/s) = 1 — 0(l/s). On the other hand, if A is a NO instance 



then for a given x, either | A,- 



x 



I 2 or IIA. 



p/2+j* 



X 



\ 2 is at least a 2 = 1 — 0(1/ d). By setting 



s = 0(e x ) appropriately, these two cases differ by an additive e, as needed. 



An Q(ds) bound. We choose A by applying Theorem 7.1 with parameters p = s,q = d, and 
W = 1/d. If A is a YES instance, then by Claim 7.7, there is a solution of cost at most 
b) 2 (l — 1/s) = 1 — @(l/s). On the other hand, if A is a NO instance, then for a given 



(a 

x, either \\Aj 



x\\ 2 or 



\A 



p/2+j* 



|2 ; 



is at least a 2 = 1 — 0(1/ d). As before, setting s = 0(e 1 ) 



appropriately causes these cases to differ by an additive e. 



Finally, it remains to show an Q(M) bound in case M = 0(e^ 1 (n + d)). We will use our pre- 
vious construction for showing an £l(ns) bound, but replace the value of n there with nf , where n' 
is the largest integer for which n's < M/2. We claim that n' > 1. To see this, by the premise of the 
theorem M > 2(n + d). Moreover, s = 0(e _1 ) and e^ 1 < n(n + d). For a small enough constant 
« > 0, s < (n + d) < M/2, as needed. 

As the theorem statement concerns matrices with n rows, each of unit norm, we must have 
an input A with n rows. In this case, since the first row of A is e\, which has sparsity 1, we can 
simply set all remaining rows to the value of ei, without changing the MEB solution. Hence, the 
lower bound is £l(n's) = Q(M). Notice that the number of non-zero entries is at most n + n's < 
M/2 + M/2 = M, as needed. 

This completes the proof. □ 



7.3 An VL(ne~ 2 ) Bound for Minimum Enclosing Ball 
7.3A Intuition 

Before diving into the intricate lower bound of this section, we describe a simple construction which 
lies at its core. Consider two distributions over arrays of size d: the first distribution, /i, is uniformly 
distributed over all strings with exactly ^ entries that are 1, and | entries that are —1. The second 
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distribution a, is uniformly distributed over all strings with exactly ^ — D entries that are 1, and 
| + D entries that are —1, for D = 0(\fd). 

Let x ~ n with probability \ and x ~ a with probability \ . Consider the task of deciding from 
which distribution x was sampled. In both cases, the distributions are over the sphere of radius 
\fd, so the norm itself cannot be used to distinguish them. At the heart of our construction lies 
the following fact: 

Fact 7.8. Any algorithm that decides with probability > | the distribution that x was sampled from, 
must read at least ®(d) entries from x. 

We prove a version of this fact in the next sections. But first, let us explain the use of this fact 
in the lower bound construction: We create an instance of MEB which contains either n vectors 
similar to the first type, or alternatively n — 1 vector of the first type and an extra vector of the 
second type (with a small bias). To distinguish between the two types of instances, an algorithm 
has no choice but to check all n vectors, and for each invest 0(d) work as per the above fact. In 
our parameter setting, we'll choose d = 0(e~ 2 ), attaining the lower bound of 0(nd) = 0(ne~ 2 ) in 
terms of time complexity. 

To compute the difference in MEB center as n i— > oo, note that by symmetry in the first case 
the center will be of the form (a, a, a), where the value a £ M is chosen to minimize the maximal 
distance: 

argmin{-(l — a) +-(— 1 — a) } = argminja — a + 1} = - 

a 4 4 a 2 

The second MEB center will be 

r,3 D., x2 ,\ £> , 2l r o , 4£> . 1 2D 

argmm{(- - — )(1 -a) + (- + -)(-l - a) } = argmmja - (1 - — )a + 1} = - - — 

Hence, the difference in MEB centers is on the order of yj d x (f ) 2 = 0(D 2 /d) = 0(1). However, 
the whole construction is scaled to fit in the unit ball, and hence the difference in MEB centers 
becomes ^ ~ e. Hence for an e approximation the algorithm must distinguish between the two 

distributions, which in turn requires ^(e~ 2 ) work. 



7.3.2 Probabilistic Lemmas 

For a set S of points in IR rf , let MEB(5) denote the smallest ball that contains S. Let Radius(S') 
be the radius of MEB (5), and Center (S) the unique center of MEB(S'). 

For our next lower bound, our bad instance will come from points on the hypercube Hd = 

J L J_\d 

t V3' Vd* ■ 

Call a vertex of Hd regular if it has ^ coordinates equal to ^ and | coordinates equal to — ^ . 
Call a vertex special if it has ^ — 12dD coordinates equal to ^ and | + \2dD coordinates equal 

to where D = 

Va Va 

We will consider instances where all but one of the input rows Ai are random regular points, 
and one row may or may not be a random special point. We will need some lemmas about these 
points. 

Lemma 7.9. Let a denote a random regular point, b a special point, and c denote the point 
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V2Vd=(^,^,.-.,^). Then 



\\c\\ = a c = - 
6 T c = E[a T 6] 



12L> 



(27) 
(28) 

(29) 

(30) 



Proof. The norm claims are entirely straightforward, and we have 



a 1 c 



1 3d 1 d _ 1 
2d ' T ~ 2d ' 4 ~ 4" 



Also (29) follows by 



I l|2 , II it 2 

a + c 



2 « Tc=1 + l- 2 l = 7- 
4 4 4 



For (30), we have 



2d V 4 / 2dU J 8 



6L> 6L> 

8 



and by linearity of expectation, 

E[o'6] = d-i- (%• (1-12D) 



d 



1 

,+4 



+ 



12D ] 



12D, 



1 



12D. 



□ 



Next, we show that a b is concentrated around its expectation (30). 



Lemma 7.10. Let a be a random regular point, and b a special point. For d > 81n 2 n, Pr[a T 6 > 
\ ~ GD] < and Pr[a T 6 < \ - ISD] < ±. 

Proof. We will prove the first tail estimate, and then discuss the changes needed to prove the second 
estimate. 

We apply the upper tail of the following enhanced form of Hoeffding's bound, which holds for 
random variables with bounded correlation. 

Fact 7.11. (Theorem 3.4 of [PS91] with their value of A equal to 1) Let X±, . . . ,Xd be given 
random variables with support {0, 1} and let X = Ylj=i Xj- Let 7 > be arbitrary. If there exist 
independent random variables X\, . . . ,X^ with X = Y^j=i Xj and E[X] < E[X] such that for all 
JQ[d], 



Pr[A, eJ X, = l]<;QPr [Xj = l 



then 



Pi[X > (1 + 7) E[X]] < 



nE[X] 



(1+7)1+7 
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Define Xj = |- (ajbj + ^) . Since ajbj G { — the Xj have support {0, 1}. Let X\, . . . , Xj 

be i.i.d. variables with support {0, 1} with E[Yj] = ELYj] for all j. 

We claim that for all J C [d], Pr [Aj g jXj = 1] < OjeJ^ 1 - i = ^ • s Y mme try, it suffices 
to prove it for J G {[1], [2], ... , [d]}. We prove it by induction. The base case J = [1] follows since 
ELY,-] = E[X,]. To prove the inequality for J =[£],£> 2, assume the inequality holds for [I — 1]. 
Then, 

Vv[h jm Xj = 1] = PrfA^^X,- = 1] • PrpQ = 1 | A^^yXj = 1], 
and by the inductive hypothesis, 



PrfA^^X,- = 1] < [J Pr 

ie[/-i] 



so to complete the induction it is enough to show 

PrLY £ = 1 | Ajeie-^Xj = 1] < PrpQ = 1]. 
Letting A (a, b) be the number of coordinates j for which a,j ^ bj, we have 

E[A(a,6)] 



(31) 



PrpQ = 1] = 1 



d 



If Aj- e r^_i]Xj = 1 occurs, then the first £ — 1 coordinates of Oj and bj have the same sign, and so 

PrlY UA y 11 1 E[A(a,b)| Ajg^X^l] E[A(a,6)] 
PrLY £ = 1 | A^jX, = 1] = 1 - + 1 = 1 " d _* +1 > 



which proves (31). 

We will apply Fact |7.1l| to bound Pr[a T 6 > r] for r = \-6D. Since X = ^a T b+^ = ^(l + a T 6), 
we have 

X - ELY] _ a T b - E[a T b] 
ELY] ~ l + E[a T 6] ' 



where we have used that (30) implies ELY] is positive (for large enough d), so we can perform the 
division. So 



X — ELY] r-E[a T 6] a T 6-E[a T 6] r - E[a T 6] 



a b — r 



ELY] 1 + E[a T 6] " 1 + E[a T 6] 1 + E[ a T 6] 1 + E[a T b] 



and so 



By Fact 



Pr[a T 6 > r] = Pr 



X - ELY] r - E[a T 6] 
ELY] > 1 + E[a T 6] 



7.11 



for T = l+E[a T 6] ' We haVe fOT 7 > °' 



Pr[a T 6 > r] < 



(1 + 



d(l+E[a T 6])/2 



By Q, r - E[a 1 6] = 6.D, and 1 < 1 + E[a 1 b] < 2, so 7 G [3L>, 6L>]. It is well-known (see Theorem 
4.3 of |MR95j ) that for < 7 < 2e - 1, e 7 < (1 + 7) 1+7 e- 72 / 4 , and so 

<i(l+E[a T 6])/2 



(1 + 7)1+7 



< exp ( ~(d(l + E[a 1 6])/2) ) = exp(- 7 ^(l + E[a 1 6])/8). 
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Since 7 > 3D and E[a T 6] > 0, this is at most exp(— D 2 d) < exp(— (Inn) 2 ) < n~ 3 , for large enough 
n, using the definition of D. 

For the second tail estimate, we can apply the same argument to —a and b, proving that 
Pr[-a T 6 > r] < 1/n 3 , where r = -1/4 + 18D. We let Xj be the {0, 1} variables §(-ajbj + \), with 

expected sum E[X] = 3d/8 + 6D. As above, Pr[— a T b > r] = Pr[ X E ^j^ > 7], where 7 = ~° T S . 

Now yVd is between 6 Inn and 8 Inn, so the same relations apply as above, and the second tail 
estimate follows. □ 



Note that since by (29) all regular points are distance \/3/2 from c, that distance is an upper 



bound for the the MEB radius of a collection of regular points. 

The next lemmas give more properties of MEBs involving regular and special points, under the 
assumption that the above concentration bounds on a T b hold for a given special point b and all a 
in a collection of regular points. 

That is, let S be a collection of random regular points. Let £ be the event that for all a G S, 



-18D <a T b-\< -6D. By Lemma 7.10 and a union bound, 



4 

2 



Pr[£] > 1 



n 



2 ■ 



when S has at most n points. 

The condition of event E applies not only to every point in S, but to every point in the convex 
hull convS. 

Lemma 7.12. For special point b and collection S of points a, if event £ holds, then for every 
a s G convS", -18D < ajfc - \ < -6D. 

Proof. Since as G convS, we have as = ^2 a &sPa. a f° r some values p a with ^2 ae sPa = 1 an d p a > 
for all a G 5. Therefore, assuming £ holds, 



a S b 



p a a T b < X>a(l/4 - 6D) = 1/4 - 6D, 



E 

aeS a£S 



and similarly a T s b > 1/4 - ISD. □ 

Lemma 7.13. Suppose b is a special point and S is a collection of regular points such that event 
£ holds. Then for any as G convS", \\as — b\\ > ^ + 6D. Since Center(S') G conv5, this bound 
applies to ||Center(5') — b\\ as well. 

Proof. Let H be the hyperplane normal to c = \dj1\fd and containing c. Then S C H, and so 
convS* C H, and since the minimum norm point in H is c, all points as G convS" have Hasll 2 > 
1/4. By the assumption that event £ holds, and the previous lemma, we have agb < \ —6D. 



2 



Using this fact, ||6|| = 1, and ||as|| > 1/4, we have 

\\as-b\\ 2 = \\a s \\ 2 + \\b\\ 2 -2a T s b 
>i + l- 2 (i-6D 

= 3 - + 12D, 

and so 1 1 as — b\\ > ^ + 6D provided D is smaller than a small constant. □ 
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Lemma 7.14. Suppose a is a regular point, b is a special point, and aJb > \ — 18D. Then there 

^ and \\q-a\\ < ^ 



is a point g£l d for which \\q - b\\ = ^ and \\q - a\\ < ^ + ®(D 2 ), as D ->• 0. 



Proof. As usual let c = ld/2\^d and consider the point q at distance ^ from b on the line segment 
cb, so 

b — c . . 

g = c + 7 • — T7 = c + <ya(b - c), 

\\b — c\\ 

where a = — c|| and 7 is a value in 0(D). From the definition of q, 

II Il2 n ||2 1 n ||2 o T 

\\q — a\\ = \\q\\ + \\a\\ — 2a q 

= \\c\\ 2 + 2-fab T c - 27a||c|| 2 + 7 2 + ||a|| 2 - 2a T c - 2jaa T b + 2^aa J c. 

Recall from ((27} that ||a|| = 1, from ((28} that a 1 c = \\c\\ 2 = \, from d30|) that b T c = 1/4 - 12D, 



and the assumption a b > 1/4 — 181?, we have 

|| 9 - a|| 2 = 1/4 + 27o(l/4 - 12£>) - 27a(l/4) + 7 2 + 1 - 2(1/4) - 2 7 a(l/4 - 18D) + 2 7 o(l/4) 
= 3/4 + 127oL> + 7 2 

< 3/4 + e(D 2 ), 

where the last inequality uses 7 = 0(D) and a = 0(1). □ 
7.3.3 Main Theorem 

Given annxrf matrix A together with the norms for all rows A4, as well as the promise that 
all \\Ai\\ = 0(1), the e-MEB-Coreset problem is to output a subset S of 0(e _1 ) rows of A for 
which Ai G (1 + e) ■ MEB( < S I ). Our main theorem in this section is the following. 

Theorem 7.15. Ifne^ 1 > d and d = Cl(e~ 2 ), then any randomized algorithm which with probability 

> 4/5 solves e-MEB-Coreset must read (l(ne~ 2 ) entries of A for some choice of its random coins. 

We also define the following problem. Given annxrf matrix A together with the norms ||^4j|| 
for all rows Ai, as well as the promise that all \\Ai\\ = 0(1), the e-MEB-Center problem is to 
output a vector x G M. d for which \\Ai — x\\ < (1 + e) min ygR d maxj £ [„] \\y — Aj||. We also show the 
following. 

Theorem 7.16. If'ne^ 2 > d and d = Cl(e~ 2 ), then any randomized algorithm which with probability 

> 4/5 solves e-MEB- Center by outputting a convex combination of the rows Ai must read fl(ne~ 2 ) 
entries of A for some choice of its random coins. 

These theorems will follow from the same hardness construction, which we now describe. Put 
d = 8e~ 2 In 2 n, which we assume is a sufficiently large power of 2. We also assume n is even. We 
construct two families J- and Q of n x d matrices A. 

The family T consists of all A for which each of the n rows in A is a regular point. 

The family Q consists of all A for which exactly n — 1 rows of A are regular points, and one row 
of A is a special point. 

(Recall that we say that a vertex of on T~Ld is regular if it has exactly ¥ coordinates equal to 
We say a point on %d is special if it has exactly d (| — \2D) coordinates equal to where 
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Let p be the distribution on n x d matrices for which half of its mass is uniformly distributed 
on matrices in J 7 , while the remaining half is uniformly distributed on the matrices in Q. Let 
A ~ p. We show that any randomized algorithm Alg which decides whether A G T or A £ Q with 
probability at least 3/4 must read £l(nd) entries of A for some choice of its random coins. W.l.o.g., 
we may assume that Alg is deterministic, since we may average out its random coins, as we may 
fix its coin tosses that lead to the largest success probability (over the choice of A). By symmetry 
and independence of the rows, we can assume that in each row, Alg queries entries in order, that is, 
if Alg makes s queries to a row Ai, we can assume it queries Ai s x, A,2j • • • > ^i,si and in that order. 

Let r = d/(Cln 2 n) for a sufficiently large constant C > 0. For a vector u S let pref(u) 
denote its first r coordinates. Let p be the distribution of pref(it) for a random regular point u. 
Let p' be the distribution of pref(u) for a random special point u. 

Lemma 7.17. (Statistical Difference Lemma) For C > a sufficiently large constant, 

II 'II <r 1 
\\P-P\\1< W 

Proof. We will apply the following fact twice, once to p and once to p' . 

Fact 7.18. (special case of Theorem 4 of (DF80J) Suppose an urn U contains d balls, each marked 
by one of two colors. Let Hjjr be the distribution of r draws made at random without replacement 
from U, and Mjjr be the distribution of r draws made at random with replacement. Then, 

4r 

\\Hjjk — Mjjk ||i < — . 

Let a be the distribution with support — ^} with f(^) = f and a(— = |. Let r be 
the distribution with support {-^, --^} with r(^) = § - 12 D and r(—^=) = \ + V2D. 

Let a r be the joint distribution of r independent samples from o~, and similarly define r r . 



Applying Fact 7.18 with r = 1/100D 2 



\p-° r \V< 1 



and 



By the triangle inequality, 



I / r II ^ 
\P T ||l < 



25dD 2 ' 
1 

25dD 2 ' 



„ ^ n r„ „ r r„ „ r /„ ^ „ r r„ 2 

|p — p||i < ||p — o" ||i + If - r Hi + ll r - p Hi < \W — t Hi + 



25d£> 2 ' 



and so it remains to bound ||cr r — r r ||i. To do this, we use Stein's Lemma (see, e.g., ??, Section 
12.8), which shows that for two coins with bias in [S7(l) , 1 — ^(1)], one needs 0(z~ 2 ) independent 
coins tosses to distinguish the distributions with constant probability, where z is the difference in 
their expectations. Here, z = 12D, and so for constant C > sufficiently large, for r = 1/CD 2 , it 
follows that \\a r — T r \\\ < We thus have 

'II 1 i 2 1 
\\P-P\\i < w + ^dD 2 ~ To' 

where the last inequality uses dD 2 = (Inn) 2 — > oo. □ 
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We use Lemma 7.17 to prove the following. We assume that Alg outputs 1 if it decides that 



A £ J 7 , otherwise it outputs 0. 

Theorem 7.19. If Alg queries o(nr) entries of A, it cannot decide if A £ J 7 with probability at 
least 3/4. 

Proof. We can think of A as being generated according to the following random process. 

1. Choose an index i* £ [n] uniformly at random. 

2. Choose rows Aj for j S [n] to be random independent regular points. 

3. With probability 1/2, do nothing. Otherwise, with the remaining probability 1/2, replace 
Aj* with a random special point. 

4. Output A. 

Define the advantage adv(Alg) to be: 



adv (Alg) 



Pr [Alg(A) = 1] - Pr [Alg(A) = 1] 



To prove the theorem, it suffices to show adv(^4Zg) < 1/4. Let Aj* denote the rows of A, excluding 
row i* , generated in step 2. By the description of the random process above, we have 



adv(Alg) = E f . A .. 



Pr [Alg(A) = l\i*, Aj*] - Pr [Alg(A) = 1 | i* , AH 

special A^* regular A ; * 



To analyze this quantity, we first condition on a certain event £ (i, Aj*) holding, which will occur 
with probability 1 — o(l), and allow us to discard the pairs (i, Aj* ) that do not satisfy the condition 
of the event. Intuitively, the event is just that for most regular Aj*, algorithm Alg does not read 
more than r entries in Aj*. This holds with probability 1 — o(l), over the choice of i* and Aj*, 
because all n rows of A are i.i.d., and so on average Alg can only afford to read o(r) entries in each 
row. 

More formally, we say a pair (i, A\*) is good if 

Pr [Alg queries at most r queries of Aj* | (i,Ai») = (i*, Aj*)]. 

regular A^* 

Let £(i*,Ai*) be the event that (i, Aj*) is good. Then, Prj* )A .„ \S (i*, Aj*)] = 1 — o(l), and we can 
upper bound the advantage by 



i* ,Aj* 



Pr [Alg(A) = 1 | £(j*,Ai*), i* , Aj*] - Pr [Alg(A) = 1 | £(i*,Aj*), **, Aj 

special A^* regular A^* 



+o(l). 



Consider the algorithm Alg^ , which on input A, makes the same sequence of queries to A as Alg 
unless it must query more than r positions of Ai*. In this case, it outputs an arbitrary value in 
{0, 1}, otherwise it outputs Alg (A). 

Claim 7.20. 



Pr [Alg(A) = 1 | £(£*, A*.), i*, Aj*] - Pr [Alg'^A) = l\i*, Aj»] 

regular A^* regular A^* 



o(l), 



38 



Proof. Since £(i*, Aj*) occurs, 



Pr [Alg makes at most r queries to A,. | i* , Aj*] = 1 — o(l). 

regular Aj* 



This implies that 



Pr [Alg(A) = 1 | £(i*,Aj*), **, Aj*] - Pr [AZi^(A) = 1 | f , A, 

regular Aj* regular Aj* 



O(l). 



□ 



By Lemma 7.17, we have that 



Pr [Alg'AA) = l\i\ Aj.] - Pr L4Z«4(A) = 1 | **, A, 

regular Aj* special Aj* 



< 



Hence, by Claim 7.20 and the triangle inequality, we have that 



Pr [Alg(A) = 1 | A*.), t*, Aj*] - Pr [Alg^(A) = I \ i* , A< 

regular Aj* special Aj* 

To finish the proof, it suffices to show the following claim 
Claim 7.21. 



Pr [Alg(A) = 1 | S(i*, A**), i*, A,*] - Pr [^.(A) = 1 | i*, Aj 

special Aj* special Aj* 



10 



< — + o(l). 
-10 v ; 



< — + o(l). 
-10 w 



Indeed, if we show Claim 7.21, then by the triangle inequality we will have that adv(^4/<?) < 
l + o(l)<i 

Proof of Claim 



7.21 



Since £(i*, Aj*) occurs, 



Pr [Alg makes at most r queries to Aj* | i*, Aj*] = 1 — o(l). 

regular Aj* 

Since p is the distribution of prefixes of regular points, this condition can be rewritten as 

Pr [Alg makes at most r queries to the i*-th row | i*, Aj*, pref(Aj*) = u] = 1 — oil). 



By Lemma 7.17 we thus have 



Pr [Alg makes at most r queries to the i*-th row | i*, Aj*, pref(Aj*) = u] > °(1)- 

* 10 

Since p' is the distribution of prefixes of special points, this condition can be rewritten as 

9 

Pr [Alg makes at most r queries to Aj* | i* , Aj*] > °(1)- 

special Aj* 10 



This implies that 



Pr [Alg(A) = 1 | £(i*, A,*), i*, Aj*] - Pr [Alg'^A) = l\i*, A,*] 

special Aj* special Aj* 



< ho(l). 

-10 



This completes the proof of the theorem. 



□ 
□ 
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7.3.4 Proofs of Theorem l7Tl5l and [7Tl6l 



Next we show how Theorem |7.19| implies Theorem 7.15 and Theorem 7.16, using the results on 
MEBs of regular and special points. 



Proof of Theorem 



7.15 



We set the dimension d = 4 • 36 • e In (n — 1). Let A' denote the set 
of regular rows of A. We condition on event £, namely, that every convex combination p T A, where 
p G A n __i, satisfies p 1 A'b < \ — 6D. This event occurs with probability at least 1 — 2n~ 2 . (We may 
neglect the difference between n and n — 1 in some expressions.) 



It follows by Lemma 7.13 that if A G Q, then for every S C A' , 



||Center(5) - 6|| > ^— + 2e. 



< It follows that any algorithm that, with probability at least 4/5, outputs 



By ((29]), Radius^') 

a subset S of 0(e~ l ) rows of A for which A^ G (1 + e) • MEB(5) must include the point b G S. 

Given such an algorithm, by reading each of the 0(e~ 1 ) rows output, we can determine if A G T 
or A G Q with an additional (D(e~ l d) time. By Theorem 7.19, the total time must be Q(ne~ 2 ). 
By assumption, ne~ l > d, and so any randomized algorithm that solves e-MEB-Coreset with 
probability at least 4/5, can decide if A G J- with probability at least 4/5 — 2n~ 2 > 3/4, and so it 
must read Q(ne~ 2 ) entries for some choice of its random coins. □ 



Proof of Theorem 7.16 We again set the dimension d = 4 • 36 • e In (n — 1). Let A' denote 



the set of regular rows of A. We again condition on the event £. 

if A G Q, then for every convex combination p T A 1 , 



By Lemma 



7.13 



||p T A' - b\\ > 



V3 



2e, 



and so the MEB radius returned by any algorithm that outputs a convex combination of rows of 
A' must be at least ^ + 2e. 



v/3 
2 ■ 



On the other hand, by Lemma 



7.14 



if 



However, by (|29j), if A G J 7 , then Radius(A) < 

A eg, then MEB-radius(A) < ^ + 6(e 2 ). 

It follows that if A G 0, then the convex combination p T A output by the algorithm must have 
a non-zero coefficient multiplying the special point b. This, in particular, implies that p T A is not 
on the affine hyperplane H with normal vector 1^ containing the point c = \dj2\fd. However, if 
A G J-, then any convex combination of the points is on H. The output p T A of the algorithm is 
on H if and only if p T Ald = which can be tested in 0(d) time. 

the total time must be £l(ne~ 2 



By Theorem 



7.19 



By assumption, ne~ 2 > d, and so any 
randomized algorithm that solves e-MEB- Center with probability > 4/5 by outputting a convex 
combination of rows can decide if A G T with probability at least 4/5 — 2n~ 2 > 3/4, and so must 
read (l(ne~ 2 ) entries for some choice of its random coins. □ 



7.4 Las Vegas Algorithms 

While our algorithms are Monte Carlo, meaning they err with small probability, it may be desirable 
to obtain Las Vegas algorithms, i.e., randomized algorithms that have low expected time but never 
err. We show this cannot be done in sublinear time. 
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Theorem 7.22. For the classification and minimum enclosing ball problems, there is no Las Vegas 
algorithm that reads an expected o(M) entries of its input matrix and solves the problem to within 
a one-sided additive error of at most 1/2. This holds even if \\Ai\\ = 1 for all rows A{. 

Proof. Suppose first that n > M. Consider nx d matrices A, B 1 , . . . B M , where for each C G 
{A, B 1 ,..., B M }, C itj = if either j > 1 or i > M. Also, A iA = 1 for i G [M], while for each j, 
B[ i = 1 if i e [M] \ {j}, while B\ ■ = —1. With probability 1/2 the matrix A is chosen, otherwise a 
matrix B^ is chosen for a random j. Notice that whichever case we are in, each of the first M rows 
of the input matrix has norm equal to 1, while all remaining rows have norm 0. It is easy to see 
that distinguishing these two cases with probability > 2/3 requires reading f2(M) entries. As f2(M) 
is a lower bound for Monte Carlo algorithms, it is also a lower bound for Las Vegas algorithms. 
Moreover, distinguishing these two cases is necessary, since if the problem is classification, if C = A 
the margin is 1, otherwise it is 0, while if the problem is minimum enclosing ball, if C = A the cost 
is 0, otherwise it is 1. 

We now assume M > n. Let d' be the largest integer for which nd! < M. Here d' > 1. Let A 
be the n x d' matrix, where Aij = ^= for all i and j. The margin of A is 1, and the minimum 
enclosing ball has radius 0. 

Suppose there were an algorithm Alg on input A for which there is an assignment to Alg's 
random tape r for which Alg reads at most nd' /A of its entries. If there were no such r, the 
expected running time of Alg is already Vt{nd') = U(M). Let Ag be a row of A for which Alg reads 
at most d'/4 entries of Ag given random tape r, and let S C [d 1 ] be the set of indices in Ag read, 
where | < d'/4. Consider the nx d! matrix B for which Bij = Aij for all i ^ £, while B^j = A^j 
for all j G S, and Bij = —Agj for all j £ [d'] \ S. Notice that all rows of A and B have norm 1. 

To bound the margin of B, consider any vector x of norm at most 1. Then 

(A t + B e )x < \\x\\ ■ \\A e + Bi\\ < \\A t + 5*11. 

A^ + Bi has at least 3d' /4 entries that are 0, while the non-zero entries all have value 2/y/df. Hence, 
\\Ae -\-B(\\ 2 < ^ ■ 4 = 1. It follows that either Apx or B^x is at most 1/2, which bounds the margin 
of B. As Alg cannot distinguish A and B given random tape r, it cannot have one-sided additive 
error at most 1/2. 

For minimum enclosing ball, notice that ||^Lj — -B^|| 2 • | > ' W ' Z = i> w hich lower bounds 
the cost of the minimum enclosing ball of B. As Alg cannot distinguish A and B given random 
tape r, it cannot have one-sided additive error at most 3/4. □ 

8 Concluding Remarks 

We have described a general method for sublinear optimization of constrained convex programs, 
and showed applications to classical problems in machine learning such as linear classification and 
minimum enclosing ball obtaining improvements in leading-order terms over the state of the art. 
The application of our sublinear primal-dual algorithms to soft margin SVM and related convex 
problems is currently explored in ongoing work with Nati Srebro. 

In all our running times the dimension d can be replaced by the parameter S, which is the 
maximum over the input rows Ai of the number of nonzero entries in Aj,. Note that d> S > M/n. 
Here we require the assumption that entries of any given row can be recovered in O(S) time, which 
is compatible with keeping each row as a hash table or (up to a logarithmic factor in run-time) in 
sorted order. 
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A Main Tools 



A.l Tools from online learning 

Online linear optimization The following lemma is essentially due to Zinkevich |Zin03j : 

Lemma A.l (OGD). Consider a set of vectors q\, . . . , qx £ M. d such that \\qiW2 < c. Let xq <— 0, 

and x t+1 ^x t + j=qt , x t+1 <- max{ ^|3 ■ Then 

T T 

max It x ~ 5^ qJ x t < 2cVT. 

X&M t=i t=i 

This is true even if each qt is dependent on x%, . . . , xt-\- 

Proof. Assume c 
for any ||rr|| < 1, 



Proof. Assume c = 1, generalization is by straightforward scaling. Let rj = —/=. By definition and 



II Il2 ^ II ~ Il2 II Il2 II 112 n T/ s . 2 1 1 I|2 

\\x-x t+ i\\ < \\x - x t+1 \\ = \\x - x t - rjq t \\ = \\x - x t \\ - 2rjq t [x - x t ) + T) \\q t \\ ■ 
Rearranging we obtain 

qj(x - x t ) < ^-[\\x - x t \\ 2 - \\x - x t+ i\\ 2 } + rj/2. 
Summing up over t = 1 to T yields 

Y^qjx-Y^qjxt < ^\\x-x 1 \\ 2 +7 1 T/2< 2 +lT<2Vf. 



□ 



For our streaming and parallel implementation, a simpler version of gradient descent, also 
essentially due to Zinkevich [Zin03| , is given by: 

Lemma A. 2 (Lazy Projection OGD). Consider a set of vectors qi, . . . , E M rf such that \\qiW2 < 1- 
Let 



Then 



x t +i <— arg min < q J • x + \ 2T\ 



.r=l 



T 



max^ qJ x — qjxt < 2\/2T '. 



t=l t=l 



This is true even if each qt is dependent on x\, . . . , xt-i- 

For a proof see Theorem 2.1 in [HazlO , where we take IZ(x) = WxW 2 ,, and the norm of the linear 
cost functions is bounded by \\qtW2 < 1, as is the diameter of K, - the ball in our case. Notice that 
the solution of the above optimization problem is simply: 

^ _ „ ~ E*=l It 

Xt + 1 ~ rTli 11T ' ~~ 7k^ — 

max{l, \\yt+i\\} V2T 
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Strongly convex loss functions The following Lemma is essentially due to [HKKA06J. 

For H € M. with H > 0, a function / : M d — > M is H-strongly convex in IB if for all all the 

eigenvalues of V 2 /(x) are at least H. 

Lemma A. 3 (OGDStrictlyConvex). Consider a set of H-strongly convex functions /i, • • • , /t such 
that the norm of their gradients is bounded over the unit ball B by G > max^ max xg ]B||V/t(x)||. Let 
x £ B, and x t +i <~x t - jVf t (x t ) , x t+ i «- max {T^~ t+1 \\y Then 

JTiis is trae et>en i/ eac/i ft is dependent on xi, ■ ■ . , xt-i- 

Again, for the MEB application and its relatives it is easier to implement the lazy versions in 
the streaming model. The following Lemma is the analogous tool we need: 

Lemma A. 4. Consider a set of H-strongly convex functions fi, ■ ■ ■ , fr such that the norm of their 
gradients is bounded over the unit ballM by G > max t max xe B||V/t(a;)||. Let 

x t +i 4- argmin { V] f T (x) > 
x & ^ j 

Then 

T T 2G 2 

V f t (x t ) - mm V ft(x) < — logT. 
*■ — * xeB 1 — » ti 

t=i t=i 

This is true even if each ft is dependent on x%, . . . , xt-x- 
Proof. By Lemma 2.3 in [HazlOj we have: 

T T 

V ft(x t ) - min V f t (x) < Y\[ft{x t ) - ft(x t +i)] 

Denote by = Y1t=i St- Then by Taylor expansion at x t +i, there exists a zt G [xt+i,xt] for 

which 

<f> t (x t ) = $ t (xt+i) + (x t ~ x t+1 ) T V<5>t(xt+i) + -\\x t - x t+l \\ 2 Zt 
> $ t {xt+i) + -\\x t - x t +i\\ 2 Zt , 

using the notation \\y\\ 2 = y T V 2 &t(z)y- The inequality above is true because xt+i is a minimum 
of over fC. Thus, 

\\x t - xt+i f Zt < 2 § t (x t ) - 2 $ t (xt+i) 

= 2 (* t _i(x t ) - $ t -i(^+i)) + 2[/ t (x 4 ) - / t (xt+i)] 

< 2[/ f (ic t ) - f t (x t +i)] optimality of x t 

< 2 Vf t {xt) T {x t - x t +i) convexity of f t . 
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By convexity and Cauchy-Schwarz: 

ft(xt) ~ ft{x t +\) < Vf t (x t )(x t - x t +i) < \\V f t (x t )\\* zt \\x t - x t+1 \\ Zt 
< \\Vf t (x t )\\: t ^2Vf t (x t y(x t -x t+1 ) 
Shifting sides and squaring, we get 

ft(xt) - ft(x t+ i) < Vft(x t )(x t - xt+i) < 2\\Vf t (x t )\\* Zt 2 
Since ft are assumed to be if-strongly convex, we have || • |L > || • \\nt, and hence for the dual norm, 



||V/t(st)||j ^ 2G 2 
Zt ~ * Ht ~ Ht 



f t (xt) - f t (x t+1 ) < 2||v/ t (x t )|i: t 2 < 2 " i 

Summing over all iterations we get 



T T „ 2 2C 2 

2>(*t) - min V/ t (x) < Y^M^t) ~ ft(x t+1 )] < ^ < logT 
t=\ 1Mb- t=1 t t 



□ 



Combining sampling and regret minimization 

Lemma A. 5. Consider a set of H-strongly convex functions /i, • • • , /t such that the norm of their 
gradients is bounded over the unit ball by G > maxt max x ^\\V ft(x)\\. Let 



yt+i 



Then for a fixed x* we have 



arg min x . eB { YX=1 fA x )} w.p. 
Vt o/w 



o 



T T 1 ?r 2 

E£/t(») -£/*(**)] ^ air logT - 

t=l t=l 
This is true even if each ft is dependent onyx,. . . , yt-i- 
Proof. Consider the sequence of functions ft defined as 

ft 




A.4 



Where denotes the all-zero function. Then the algorithm from Lemma A.4 applied to the functions 
ft is exactly the algorithm we apply above to the functions ft- Notice that the functions ft are 
^-strongly convex, and in addition their gradients are bounded by ^. Hence applying Lemma 
we obtain 

T T T T 2 



a H 

t=i t=i t=i t=i 



□ 
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B Auxiliary lemmas 

First, some simple lemmas about random variables. 

Lemma B.l. Let X be a random variable with | E[X]\ < C , and let X = clip(X, C) = min{C, max{— C, X}} 
for some Cel. Then 

|E[X]-E[X]|<^1. 

Proof. By direct calculation: 

E[X]-E[X}= f Pv[x](-C-x) + f Pr[x]{C-x), 

Jx<-C Jx>C 

< / Pr[x]\x\ - / Pt[x}C 

Jx<-C Jx<-C 

< [ Pr[x}x 2 /C- [ Pr[x]C 

Jx<~C Jx<-C 

= / Pv[x}— — 

Jx<-C u 



< 



/ Pr[x] 

Jx<~C 

VarjX 2 } 

C 



x 



E[X]< 



C 



since \B[X}\ < C 



□ 



and similarly B[X] - B[X] > -Var[X]/C, and the result follows. 

Lemma B.2. For random variables X and Y , and a 6 [0, 1], 

E[(aX + (1 - a)Y) 2 } < max{E[X 2 ], B[Y 2 }}. 

This implies by induction that the second moment of a convex combination of random variables 
is no more than the maximum of their second moments. 

Proof. We have, using Cauchy-Schwarz for the first inequality, 

E[(aX + (1 - a)Y) 2 ] = a 2 B[X 2 ] + 2q(1 - a) E[XY] + (1 - a) 2 E[Y 2 ] 

< a 2 E[X 2 } + 2q(1 - a)^E[X 2 ]E[Y 2 ] + (1 - a) 2 E[Y 2 } 
= (a^/E[X 2 ] + (1 - a) V / E\Y 2 ]) 2 
<max{7E[X2], y/W 2 ]} 2 
= max{E[X 2 ],E[y 2 ]}. 



□ 



B.l Martingale and concentration lemmas 

The Bernstein inequality, that holds for random variables Zt,t € [T] that are independent, and 
such that for all t, E[Z t ] = 0, E[Z t 2 ] < s, and \Z t \ < V, states 

logProb{ z t > a} < -a 2 /2(Ts + aV/3) (32) 
te[T] 
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Here we need a similar bound for random variables which are not independent, but form a martin- 
gale with respect to a certain filtration. Many concentration results have been proven for Mar- 
tingales, including somewhere, in all likelihood, the present lemma. However, for clarity and 
completeness, we will outline how the proof of the Bernstein inequality can be adapted to this 
setting. 

Lemma B.3. Let {Z t } be a martingale difference sequence with respect to filtration {St}, such that 
E[Zt\Si, St] = 0. Assume the filtration {St} is such that the values in St are determined using 
only those in St-i, and not any previous history, and so the joint probability distribution 

Prob{Si = si, S 2 = s 2 , ■ ■ ■ , S T = s t } = \ j Prob{S t+ i = s t +i | S t = s t }, 

te[T-i\ 

In addition, assume for all t, E[2?|Si, S t ] < s, and \Z t \ < V. Then 

logProb{^ Z t > a} < -a 2 /2(Ts + aV/3). 
teT 

Proof. A key step in proving the Bernstein inequality is to show an upper bound on the exponential 
generating function E[exp(AZ)], where Z = ^2 t Zt, an< ^ A > is a parameter to be chosen. This 
step is where the hypothesis of independence is applied. In our setting, we can show a similar 
upper bound on this expectation: Let E([] denote expectation with respect to St, and Ej t j denote 
expectation with respect to St for t 6 [T]. This expression for the probability distribution implies 
that for any real- valued function / of state tuples St, 



E 



te[T] 



= f(si)[ [ II f^+^ II Prob{5j + i = s t+ i I St = s t }] 

J S2, -,ST .-r m 



te[T-i] 



te[T-i] 



= f(si) I 

J s- 



«2,-,ST-l 



[ II f(s t+ i)][ n Prob{5t+i = st+i I St = s t }] 
te[T-2] te[T-2] 



/ /(s T ) Prob{S , T = s T | S T -i = s T -i} 



where the inner integral can be denoted as the conditional expectation Et[/(5t) | St-i]- By 
induction this is 



f f(s 2 ) Prob{5 2 = s 2 | 5i = ai} f ... f f(s T ) Pioh{S T = s T \ S T ^ = s T -i} 

J S2 L</ S3 J st 



and by writing the constant /(Si) as the expectation with respect to the constant Sq = so, and 
using Ex[Ejsf[X]] = Ex[^] for any random variables X and Y, we can write this as 

E[T][II f(St)] = V [T] [Yl V t [f(St) | St-!]]. 
te[T] te[T] 

For fixed i and a given A £ 1, we take /(Si) = 1, and f(St) = exp(AZ t _i), to obtain 



E 



[T] 



exp(A V Z t )] 



te[T] 



E 



[T] 



H E t [exp(AZ t ) | S t -!] 
te[T] 
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Now for any random variable X with E[X] = 0, E[X 2 ] < s, and \X\ < V, 

E[exp(AX)] < exp (^(e XV - 1 - XV)) , 
(as is shown and used for proving Bernstein's inequality in the independent case) and therefore 



E [r] [exp(AZ)] < E [T] 



J] exp (±i<W - 1 - Ay)) = exp (t^ v - 1 - XV)) 

te[T] 



where Z = Y2te[T] ^t- This bound is the same as is obtained for independent Zf, and so the 
remainder of the proof is exactly as in the proof for the independent case: Markov's inequality is 
applied to the random variable exp(AZ), obtaining 



Prob{Z >a}< exp(-Aa) E [T] [exp(AZ)] < exp(-Aa + Ty^{e xv - 1 - AV)), 
and an appropriate value A = y log(l + aV/Ts) is chosen for minimizing the bound, yielding 

rri 

Prob{Z >a}< exp(-^((l + 7 ) log(l + 7 ) - 7 )), 

where 7 = aV/Ts, and finally the inequality for 7 > that (1 + 7 )log(l + 7 ) — 7 > ^ s 
applied. □ 

B.2 Proof of lemmas used in main theorem 



We restate and prove lemmas |2.4|2.5| and 2.6, in slightly more general form. In the following we 



only assume that vt(i) = clip(vt(i), -) is the clipping of a random variable vt(i)- The variance of 
vt(i) is at most one Varfw^i)] < 1, and we denote by fJ*t(i) = E[^(i)]. We also assume that the 
expectations of Vt(i) are bounded by an absolute constant |//t(i)| < C < -. This constant is one 
for the perceptron application, but at most two for MEB. Note that since the variance of vt(i) is 
bounded by one, so is the variance of it's clipping vt(i) ^ 



Lemma B.4. For rj < \l with probability at least 1 — 0(l/n), 



max \ [vt(i) — Ht{i)\ < 90r/T. 



te[T] 



Proof. Lemma B.l implies that | E[ut(z)] — m{i)\ < rj, since Var[^(i)] < 1. 

We show that for given i £ [n], with probability 1 — 0(l/n 2 ), YltelT] [ v t(^) ~ ^[ v t(^)]\ — 80r/T, and 
then apply the union bound over all i £ [n]. This together with the above bound on | ~E[vt(i)] — fj,t(i)\ 
implies the lemma via the triangle inequality. 

Fixing i, let Z\ = vt(i) — E[vt(i)], and consider the filtration given by 

S t = (x t ,p t ,wt,yt,v t -i,it-i,jt-i,vt-i - E[v t -i}), 
Using the notation E^[-] = E[-|St], Observe that 



2 This follows from the fact that the second moment only decreases by the clipping operation, and definition of 
variance as Var(wt(i)) = min 2 E[v t (i) 2 - z 2 ]. We can use z — E[vt(i)], and hence the decrease in second moment 
suffices. 
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1. Vt . E t [(Zi) 2 ] = B t [v t (i) 2 ] - E t [v t (i)} 2 = Vai(vt(i)) < 1. 

2. |Z|| < 2/77. This holds since by construction, |t>t(i)| < I/77, and hence 



\Z\\ = \v t (i) - E[v t (i)]\ < \v t (i}\ + I E[u t (i)]| < 



Using these conditions, despite the fact that the Z\ are not independent, we can use Lemma B.3 
and conclude that Z = ^2 t £T Z\ satisfies the Bernstein- type inequality with s = 1 and V = 2/rj 

logProb{Z > a} < -a 2 /2(Ts + aV/3) < -a 2 /2(T + 2a/3rj), 

Letting a <— 8O7/T, we have 

logProb{Z > 8O77T} < -a 2 /2(T + 2a/3rj) < -20rj 2 T 



For 77 



above probability is at most e 21o s ra < \ 



□ 



Lemma 2.5 can be restated in the following more general form: 



Lemma B. 5. Forr/ < y-j§^, with probability at least 1— 0(1/ n), it holds that X)te[X] A*t(*t) ~~ J2tPt v t 
IOOC77T. 

It is a corollary of the following two lemmas: 
Lemma B.6. For r/ < \Jj§p, with probability at least 1 — 0(l/n), 



< 



te[T] t 



< 90r/T. 



Proof. This Lemma is proven in essentially the same manner as Lemma 2.4 and proven below for 
completeness. 



Lemma B.l implies that | Ekvf(i)] — m{i)\ < rj, using Var[{> t (i)] < 1. Since pt is a distribution, 
it follows that | E[p7 u i] ~ Pt < V 

Let Z t = pj vt — E[p7 v t] = ^2iPt(i)Zl, where Z\ = vt(i) — E[vt(i)]. Consider the filtration given 

by 

S t = (x t ,pt,w t ,yt,vt-i,it-i,jt-i,vt-i - E[v t -i]), 
Using the notation EJ-] = E[-|iS$], the quantities \Zt\ and Et[Z 2 ] can be bounded as follows: 



\Zt\ = I ^^pt(i)Zl\ < ^Jpt(z)|Z|| < 2r/ 1 using \Zj\ < 2r/ 1 as in Lemma 

i i 

Also, using properties of variance, we have 

E[Z t 2 ] = Vax\pjv t ] = p t (i) 2 Va.r(v t (i)) < max Var[v t (i)] < 1. 



2.4 



We can now apply the Bernstein-type inequality of Lemma B.3, and continue exactly as in 
Lemma EH □ 
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Lemma B.7. For rj < \r^r, with probability at least 1 — 0(\/n), 



Yl ~ Y Pt ^ 

*e[T] t 



< 10Cr]T. 



Proof. Let Zt = Ht{it) ~PtlHi where now \it is a constant vector and it is the random variable, and 
consider the filtration given by 

S t = (x t ,pt,wt,yt,v t -i,it-i,jt-i,Zt-i), 

The expectation of /it (it), conditioning on Sf with respect to the random choice r(i t ), is pt/it. Hence 
Et[Zt] = 0, where Et[-] denotes E[-|Sf]. The parameters |Zt| and E[Z 2 ] can be bounded as follows: 

|2t| < \lH(i)\ + \ptlH\ < 2C 

E[Z 2 ] = E[( Mt « - p^) 2 ] < 2E[/i t (*) 2 ] + 2(p t T ^)2 < 4C 2 

to Z = X^teT ^t> w ith parameters s < 4C 2 , V < 2C, we obtain 
logProb{Z > a} < -a 2 /(4C 2 T + 2Ca), 



Applying Lemma 



B.3 



Letting a <— WCrjT, we obtain 

logProb{Z > Wt]T} < 

Where the last inequality holds assuming r) < 



W0 V 2 C 2 T 2 



4C 2 T + 20C 2 r]T 



< 5ry T < logn 



log n 



□ 



Finally, we prove Lemma 2.6 by a simple application of Markov's inequality: 
Lemma B.8. w.p. at least 1 — ^ it holds that Y2tPt v t — 8C 2 T. 
Proof. By assumption, E[u 2 (i)] < C 2 , and using Lemma 



B.l 



1)2 

By linearity of expectation, we have ^[Y^tPt v t\ — 2C 7 T, and since the random variables v\ 



we have E[v t (i) 2 } < (C + ^) 2 < 2C 2 



are non-negative, applying Markov's inequality yields the lemma. 



□ 



C Bounded precision 

All algorithms in this paper can be implemented with bounded precision. 

First we observe that approximation of both the training data and the vectors that are "played" 
does not increase the regret too much, for both settings we are working in. 

Lemma C.l. Given a sequence of functions /i, • • • , /t and another sequence /i, • • ■ , /r all mapping 
M. d to M, such that \ft(x) — ft(%)\ < otf for all x G B and t G [T], suppose x\, . . . ,xt G B is a 
sequence of regret R against {ft}, that is, 

xGE * — * * — * 

te[T] te[T] 

Now suppose x\, . . . , xt G ^ d is a sequence with \ ft{xt) — ft(xt)\ < ct x for all t G [T]. Then 

max } f t (x) - } ft(x t ) < R + T(a x + 2a/). 
te[r] t£[r] 
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Proof. For x G B, we have J2te[T] ft( x ) - Ylte[T] M x ) + Tatf, and 

ft(xt) > ^2 ~ Toix - ~ Toix ~ Ta f> 



te[T] t&[T] te[T] 



and the result follows by combining these inequalities. □ 

That is, xt is some sequence known to have small regret against the "training functions" ft(x), 
which are approximations to the true functions of interest, and the xt are approximations to these 
x t . The lemma says that despite these approximations, the xt sequence has controllable regret 
against the true functions. 

This lemma is stated in more generality than we need: all functions considered here have the 
form ft(x) = b t + qj x + "f\\x\\ 2 , where \b t \ < 1, qt G B, and | — y | < 1. Thus if ft(x) = b t + qj x + j\\x\\ 2 , 
then the first condition \ft(x) — ft(x)\ < atf holds when \bt — bt\ + \\qt — qt\\ < «/• Also, the second 
condition \ft(5>t) — ft(%t)\ — a x holds for such functions when \\xt — x t \\ < a x /2>. 

Lemma C.2. Given a sequence of vectors qx,...,qx G W 1 , with \\qt\\oo < B fort G [T], and a 
sequence qi,...,qx G W 1 such that \\q~t — 9t||oo — a q f or a ^ t £ [T]> suppose p\, . . . ,pr £ A. is a 
sequence of regret R against {qt}, that is, 



V pjq t - mm V p 1 q t < R. 

c — * pSA ' — * 

te[T] te[T] 



Now suppose pi, . . . ,px G W 1 is a sequence with \\pt — Pt\\i < ct p for all t G [T]. Then 

pjqt - min V] p 1 q t < R + T(Ba p + 2a q ). 

te[T] te[T] 

Proof. For p G A we have J2te[T\ P 1 H - X^e[T] + ^a<? ; and 

Ptit < ^2 Pt qt + TBa p ^ + TBa p + Ta 9' 

*e[T] te[T] te[T] 

The proof follows by combining the inequalities. □ 

Note that to have \\pt — pt ||i < a p , it is enough that the relative error of each entry of pt is a p . 

The use of q~t in place of qt (for either of the two lemmas) will be helpful for our semi-streaming 
and kernelized algorithms (^5j ^6]), where computation of the norms \\yt\\ of the working vectors yt 
is a bottleneck; the above two lemmas imply that it is enough to compute such norms to within 
relative e or so. 

C.l Bit Precision for Algorithm [l] 

First, the bit precision needed for the OGD part of the algorithm. Let 7 denote a sufficiently 



small constant fraction of e, where the small constant is absolute. From Lemma C.l and following 
discussion, we need only use the rows Ai up to a precision that gives an approximation Ai that 
is within Euclidean distance 7, and similarly for an approximation xt of xt- For the latter, in 
particular, we need only compute ||yt|| to within relative error 7. Thus a per-entry precision of 
j/Vd is sufficient. 
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We need ||xt|| for £2 sampling; arithmetic relative error 7/vd in the sampling procedure gives 
an estimate of vt(i) for which ELA^] = Axt, where xt is a vector within 0(7) Euclidean distance 



of xt- We can thus charge this error to the OGD analysis, where x% is the x% of Lemma C.l 

For the MW part of the algorithm, we observe that due to the clipping step, if the initial 
computation of Vt(i), Line|9j is done with rje/5 relative error, then the computed value is within e/5 
additive error. Similar precision for the clipping implies that the computed value of vt(i), which 



takes the place of qt in Lemma C.2 is within e/5 of the exact version, corresponding to qt in the 



lemma. Here B of the lemma, bounding ||<ft||oo, is I/77, due to the clipping. 



It remains to determine the arithmetic relative error needed in the update step, Line 11, to 



keep the relative error of the computed value of pt, or pt of Lemma C.2, small enough. Indeed, if 
the relative error is a small enough constant fraction of rje/T, then the relative error of all updates 
together can be r/e/3. Thus a p < r/e/3 and a q < e/3 and the added regret due to arithmetic error 
is at most Te. 

Summing up: the arithmetic precision needed is at most on the order of 

— logmin{e/v / d, rje, r]e/T} = 0(log(nd/e)), 

to obtain a solution with additive Te/10 regret over the solution computed using exact computation. 
This implies an additional error of e/10 to the computed solution, and thus changes only constant 
factors in the algorithm. 

C.2 Bit Precision for Convex Quadratic Programming 

From the remarks following Lemma |C.1[ the conditions of that lemma hold in the setting of convex 



quadratic programming in the simplex, assuming that every Ai £ B. Thus the discussion of {CA 
carries over, up to constants, with the simplification that computation of \\yt\\ is not needed. 
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